On installing apex (+without sudo/docker)

While I don't think this is a bug in apex code, I think technically it's a deficiency in the documentation.

Having counted the number of people in the Issues tab/stackoverflow/etc etc whose issues were due to CUDA version mismatch, I strongly believe it would be very helpful to have some additional installation instructions. A first glance of README by a novice (yeah I'm a CUDA novice hi nice to meet you) does not impart the impression that this is even remotely reliant on matching CUDA versions with Pytorch. (My first reaction was "Uh okay but I thought CUDA was "backwards-compatible"?")

Describe the Bug

Minimal Steps/Code to Reproduce the Bug

mamba install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
git clone https://github.com/NVIDIA/apex.git
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

...aaaaaaaaaand installation crashes.

Expected Behavior You'd read the README, and expect this installation to work because, well, there are no mentions of version-specific dependency anywhere. Of course, this might be common sense to the seasoned developer, but isn't the point of having an "installation instructions" to make life easier for everyone, not only the experts.

Suggested Action

Clearly indicate on README.md that CUDA version needs to be matched with the version used to compile Pytorch.
Include some possible suggestions for installing apex without sudo rights i.e. without switching out systemwide CUDA and potentially breaking everything else.

the second one doesn't really have to do with any bug or deficiency, but it'd be handy to have these directions around. (I outline a method to do this below.)

Environment Below, I explain how I installed apex without installing a systemwide CUDA nor pulling any containers. I am working on WSL 2, the distro is Ubuntu 22.04. I will assume that you have a conda/mamba venv where you have Pytorch already installed, since you'd already be using torch if you decided to replace your AdamW with apex's, for instance.

Also I did not install ninja, but I don't think it'd matter.

python -m torch.utils.collect_env for your reference:

[pip3] numpy==1.26.4
[pip3] torch==2.2.1
[pip3] torchaudio==2.2.1
[pip3] torchvision==0.17.1
[pip3] triton==2.2.0
[conda] blas                      2.116                       mkl    conda-forge
[conda] blas-devel                3.9.0            16_linux64_mkl    conda-forge
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] libblas                   3.9.0            16_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            16_linux64_mkl    conda-forge
[conda] libjpeg-turbo             2.0.0                h9bf148f_0    pytorch
[conda] liblapack                 3.9.0            16_linux64_mkl    conda-forge
[conda] liblapacke                3.9.0            16_linux64_mkl    conda-forge
[conda] mkl                       2022.1.0           h84fe81f_915    conda-forge
[conda] mkl-devel                 2022.1.0           ha770c72_916    conda-forge
[conda] mkl-include               2022.1.0           h84fe81f_915    conda-forge
[conda] numpy                     1.26.4          py310hb13e2d6_0    conda-forge
[conda] pytorch                   2.2.1           py3.10_cuda12.1_cudnn8.9.2_0    pytorch
[conda] pytorch-cuda              12.1                 ha16c6d3_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                2.2.1               py310_cu121    pytorch
[conda] torchtriton               2.2.0                     py310    pytorch
[conda] torchvision               0.17.1              py310_cu121    pytorch

Bonus - My method of installing apex

Check CUDA the version with which Pytorch was built.
```
>>> import torch
>>> torch.version.cuda
'12.1'
```
Did you know that you can just install CUDA on conda as well?? I only learned of this today. I was somehow under the misconception that only the toolkit was available.
```
$ mamba install nvidia/label/cuda-12.1.0::cuda
```
Be sure to match the CUDA version with the version torch was built with.
Navigate to where the libraries are for your venv. In my case, this was /home/stetstet/mambaforge/envs/apex/lib. Make sure that none of the relevant libraries have their soft-links broken. In my case, libcudart.so had a broken link, which caused compilation to fail halfway (ugh)

$ ll | grep cudart
...
libcudart.so -> libcudart.so.12.1.55
libcudart.so.12 -> libcudart.so.12.1.105
libcudart.so.12.1.105

Easily fixed:

rm libcudart.so
ln -s libcudart.so.12.1.105 libcudart.so

No idea how this happened, but it'd be either torch or the mamba install cuda.

And now we can:

cd (WHERE APEX IS)/apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

In my case, this did not succeed on my first shot. On one of my systems, build said cuda_profiler_api.h was missing, so I did mamba install nvidia/label/cuda-12.1.0::cuda-nvprof and mamba install nvidia/label/cuda-12.1.0::cuda-profiler-api (no idea which one did the job). However, on another system I did not see this message. Not sure what happened here, but you can probably patch everything up as you go, installing necessary stuff one by one.

After the compilation the below should not throw any errors.

import torch
import amp_C # must be after importing torch

Finally, to demonstrate that we didn't accidentally install another systemwide CUDA:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Feb__7_19:32:13_PST_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0
$ mamba deactivate
$ nvcc --version
(system tells me to install it with apt, I won't)

I'm not entirely sure if my solution is 100% safe, but hope this helps.

NVIDIA / apex

On installing apex (+without sudo/docker) #1781