nvidia/APEXが落ちる

0num4 commented 10 months ago

post-create.shの該当の箇所をコメントアウトしないと動かない

# # Install Apex.
# pushd /workspaces
# git clone 'https://github.com/NVIDIA/apex.git'
# pushd apex
# MAX_JOBS=$(nproc) python3 -m pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
# popd
# rm -rf apex
# popd

0num4 commented 10 months ago

適当にapexを動かしてみるとこんな感じで落ちてる

vscode ➜ /tmp $ git clone 'https://github.com/NVIDIA/apex.git'
Cloning into 'apex'...
remote: Enumerating objects: 11521, done.
remote: Counting objects: 100% (3589/3589), done.
remote: Compressing objects: 100% (502/502), done.
remote: Total 11521 (delta 3261), reused 3194 (delta 3084), pack-reused 7932
Receiving objects: 100% (11521/11521), 15.43 MiB | 7.10 MiB/s, done.
Resolving deltas: 100% (8090/8090), done.
vscode ➜ /tmp $ cd apex/
vscode ➜ /tmp/apex (master) $ pip install .
Processing /tmp/apex
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error

  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [18 lines of output]
      Traceback (most recent call last):
        File "/opt/conda/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 353, in <module>
          main()
        File "/opt/conda/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 335, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/conda/lib/python3.11/site-packages/pip/_vendor/pyproject_hooks/_in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-ss6nazen/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 325, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/tmp/pip-build-env-ss6nazen/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 295, in _get_build_requires
          self.run_setup()
        File "/tmp/pip-build-env-ss6nazen/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 311, in run_setup
          exec(code, locals())
        File "<string>", line 5, in <module>
      ModuleNotFoundError: No module named 'packaging'
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
vscode ➜ /tmp/apex (master) $

0num4 commented 10 months ago

apexを使うにはcudaとpytorchが必要。 https://github.com/NVIDIA/apex https://trend-tracer.com/nvidia-apex-install/

0num4 commented 10 months ago

エラーメッセージによると、問題の根本原因は、PyTorchがコンパイルされたCUDAバージョンと、現在使用しているCUDAバージョンとの間に不一致があることです。PyTorchバイナリはCUDA 11.8でコンパイルされていますが、インストールしようとしている環境ではCUDA 11.7が使われています。

とのことです。 https://chat.openai.com/share/53feef6c-c50d-4d13-ab05-ab22c8a51500

0num4 commented 10 months ago

11.8にして適当にpytorch入れたら動いた(めちゃくちゃ重い)

pushd /workspaces
wget 'https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin'
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
# wget 'https://developer.download.nvidia.com/compute/cuda/11.7.1/local_installers/cuda-repo-wsl-ubuntu-11-7-local_11.7.1-1_amd64.deb'
# wget 'https://developer.download.nvidia.com/compute/cuda/12.3.1/local_installers/cuda-repo-wsl-ubuntu-12-3-local_12.3.1-1_amd64.deb'
  wget 'https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda-repo-wsl-ubuntu-11-8-local_11.8.0-1_amd64.deb'

# sudo dpkg -i cuda-repo-wsl-ubuntu-11-7-local_11.7.1-1_amd64.deb
# sudo dpkg -i cuda-repo-wsl-ubuntu-12-3-local_12.3.1-1_amd64.deb
sudo dpkg -i cuda-repo-wsl-ubuntu-11-8-local_11.8.0-1_amd64.deb

# rm -f cuda-repo-wsl-ubuntu-11-7-local_11.7.1-1_amd64.deb
# rm -f cuda-repo-wsl-ubuntu-12-3-local_12.3.1-1_amd64.deb
rm -f cuda-repo-wsl-ubuntu-11-8-local_11.8.0-1_amd64.deb
popd
# sudo cp /var/cuda-repo-wsl-ubuntu-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/
# cp /var/cuda-repo-wsl-ubuntu-12-3-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo cp /var/cuda-repo-wsl-ubuntu-11-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda
# Install CUDA END

conda update -y python
conda update -y -n base conda
conda install -y numpy
conda install -y pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

0num4 commented 9 months ago

RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 12.1.

apexのバージョン固定したほうが良さそう

0num4 commented 9 months ago

CUDA toolkitが 11.7を使って、pytorchもcu117でビルドされたバージョンまで下げたらビルド通った

https://github.com/NVIDIA/apex/issues/1735#issuecomment-1751917444

  171  git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82
  173  cd ../kanachan-wsl/
  174  python pytorch-version-print.py 
  175  pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117
  176  cd -
  177  pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

0num4 / kanachan

nvidia/APEXが落ちる #2