Nerogar / OneTrainer

OneTrainer is a one-stop solution for all your stable diffusion training needs.
GNU Affero General Public License v3.0
1.73k stars 144 forks source link

[Docs]: Documenting CUDA Toolkit installation on Linux #490

Open Arcitec opened 4 weeks ago

Arcitec commented 4 weeks ago

I'm not sure whether we should put this information on the wiki, in README, or anywhere else.

It can be pretty difficult to get the required CUDA Toolkit version on Linux, since it's common that only the latest version (12 at the moment) is shipped with the OS.

Users will see errors such as this (including so people can find this thread via search):

Could not load library libcudnn_cnn_infer.so.8. Error: libnvrtc.so: cannot open shared object file: No such file or directory

Or this (usually when running natively via Venv):

venv/lib/python3.10/site-packages/torch/nn/modules/conv.py:456: UserWarning: Applied workaround for CuDNN issue, install nvrtc.so (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:84.)

Torch tries to work around any issues by using its own CUDA Toolkit which its requirements.txt installs via Python's PyPi packages, but it's a bit suboptimal since it's still missing some files as seen in the example above. (Edit: Have now tested the newest PyTorch for CUDA 12.4, and it no longer has that issue.)


Manually installing old CUDA Toolkits side-by-side in the OS itself is possible via NVIDIA's special installers, but can lead to issues due to older libraries overwriting some newer ones. So I don't recommend installing old CUDA Toolkits in the host OS.

The easiest solution for users is to have Conda on their system, running ./install.sh so that the OneTrainer Conda environment is created, and then executing this command inside the OneTrainer directory to install CUDA Toolkit 11.8 directly into the conda_env directory:

conda install -y --prefix "conda_env" --channel "nvidia/label/cuda-11.8.0" cuda-toolkit

Conda makes it easy to provide the necessary CUDA Toolkit versions without needing to have anything on the host OS.

That's currently the newest 11.x version. And the best way to know what other toolkits are available is to look at this list:

https://anaconda.org/nvidia/cuda/labels

Arcitec commented 2 weeks ago

This will probably be fixed for Conda users (and maybe even venv users) when OneTrainer moves to PyTorch 2.5, since newer versions of PyTorch have fixed that bug. I'll keep an eye on that and check again when the new branch is merged:

https://github.com/Nerogar/OneTrainer/tree/torch_2_5