While I don't think this is a bug in apex code, I think technically it's a deficiency in the documentation.
Having counted the number of people in the Issues tab/stackoverflow/etc etc whose issues were due to CUDA version mismatch, I strongly believe it would be very helpful to have some additional installation instructions. A first glance of README by a novice (yeah I'm a CUDA novice hi nice to meet you) does not impart the impression that this is even remotely reliant on matching CUDA versions with Pytorch. (My first reaction was "Uh okay but I thought CUDA was "backwards-compatible"?")
Describe the Bug
Minimal Steps/Code to Reproduce the Bug
mamba install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
git clone https://github.com/NVIDIA/apex.git
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./
...aaaaaaaaaand installation crashes.
Expected Behavior
You'd read the README, and expect this installation to work because, well, there are no mentions of version-specific dependency anywhere. Of course, this might be common sense to the seasoned developer, but isn't the point of having an "installation instructions" to make life easier for everyone, not only the experts.
Suggested Action
Clearly indicate on README.md that CUDA version needs to be matched with the version used to compile Pytorch.
Include some possible suggestions for installing apex without sudo rights i.e. without switching out systemwide CUDA and potentially breaking everything else.
the second one doesn't really have to do with any bug or deficiency, but it'd be handy to have these directions around. (I outline a method to do this below.)
Environment
Below, I explain how I installed apex without installing a systemwide CUDA nor pulling any containers. I am working on WSL 2, the distro is Ubuntu 22.04. I will assume that you have a conda/mamba venv where you have Pytorch already installed, since you'd already be using torch if you decided to replace your AdamW with apex's, for instance.
Also I did not install ninja, but I don't think it'd matter.
python -m torch.utils.collect_env for your reference:
Be sure to match the CUDA version with the version torch was built with.
Navigate to where the libraries are for your venv. In my case, this was /home/stetstet/mambaforge/envs/apex/lib. Make sure that none of the relevant libraries have their soft-links broken. In my case, libcudart.so had a broken link, which caused compilation to fail halfway (ugh)
No idea how this happened, but it'd be either torch or the mamba install cuda.
And now we can:
cd (WHERE APEX IS)/apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./
In my case, this did not succeed on my first shot. On one of my systems, build said cuda_profiler_api.h was missing, so I did mamba install nvidia/label/cuda-12.1.0::cuda-nvprof and mamba install nvidia/label/cuda-12.1.0::cuda-profiler-api (no idea which one did the job). However, on another system I did not see this message. Not sure what happened here, but you can probably patch everything up as you go, installing necessary stuff one by one.
After the compilation the below should not throw any errors.
import torch
import amp_C # must be after importing torch
Finally, to demonstrate that we didn't accidentally install another systemwide CUDA:
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0
$ mamba deactivate
$ nvcc --version
(system tells me to install it with apt, I won't)
I'm not entirely sure if my solution is 100% safe, but hope this helps.
While I don't think this is a bug in apex code, I think technically it's a deficiency in the documentation.
Having counted the number of people in the Issues tab/stackoverflow/etc etc whose issues were due to CUDA version mismatch, I strongly believe it would be very helpful to have some additional installation instructions. A first glance of README by a novice (yeah I'm a CUDA novice hi nice to meet you) does not impart the impression that this is even remotely reliant on matching CUDA versions with Pytorch. (My first reaction was "Uh okay but I thought CUDA was "backwards-compatible"?")
Describe the Bug
Minimal Steps/Code to Reproduce the Bug
...aaaaaaaaaand installation crashes.
Expected Behavior You'd read the README, and expect this installation to work because, well, there are no mentions of version-specific dependency anywhere. Of course, this might be common sense to the seasoned developer, but isn't the point of having an "installation instructions" to make life easier for everyone, not only the experts.
Suggested Action
apex
withoutsudo
rights i.e. without switching out systemwide CUDA and potentially breaking everything else.the second one doesn't really have to do with any bug or deficiency, but it'd be handy to have these directions around. (I outline a method to do this below.)
Environment Below, I explain how I installed apex without installing a systemwide CUDA nor pulling any containers. I am working on WSL 2, the distro is Ubuntu 22.04. I will assume that you have a conda/mamba venv where you have Pytorch already installed, since you'd already be using torch if you decided to replace your AdamW with apex's, for instance.
Also I did not install
ninja
, but I don't think it'd matter.python -m torch.utils.collect_env
for your reference:Bonus - My method of installing apex
Check CUDA the version with which Pytorch was built.
Did you know that you can just install CUDA on conda as well?? I only learned of this today. I was somehow under the misconception that only the toolkit was available.
Be sure to match the CUDA version with the version
torch
was built with.Navigate to where the libraries are for your venv. In my case, this was
/home/stetstet/mambaforge/envs/apex/lib
. Make sure that none of the relevant libraries have their soft-links broken. In my case,libcudart.so
had a broken link, which caused compilation to fail halfway (ugh)Easily fixed:
No idea how this happened, but it'd be either torch or the
mamba install cuda
.In my case, this did not succeed on my first shot. On one of my systems, build said
cuda_profiler_api.h
was missing, so I didmamba install nvidia/label/cuda-12.1.0::cuda-nvprof
andmamba install nvidia/label/cuda-12.1.0::cuda-profiler-api
(no idea which one did the job). However, on another system I did not see this message. Not sure what happened here, but you can probably patch everything up as you go, installing necessary stuff one by one.After the compilation the below should not throw any errors.
I'm not entirely sure if my solution is 100% safe, but hope this helps.