Open Arcitec opened 4 weeks ago
This will probably be fixed for Conda users (and maybe even venv users) when OneTrainer moves to PyTorch 2.5, since newer versions of PyTorch have fixed that bug. I'll keep an eye on that and check again when the new branch is merged:
I'm not sure whether we should put this information on the wiki, in README, or anywhere else.
It can be pretty difficult to get the required CUDA Toolkit version on Linux, since it's common that only the latest version (12 at the moment) is shipped with the OS.
Users will see errors such as this (including so people can find this thread via search):
Or this (usually when running natively via Venv):
Torch tries to work around any issues by using its own CUDA Toolkit which its requirements.txt installs via Python's PyPi packages, but it's a bit suboptimal since it's still missing some files as seen in the example above. (Edit: Have now tested the newest PyTorch for CUDA 12.4, and it no longer has that issue.)
Manually installing old CUDA Toolkits side-by-side in the OS itself is possible via NVIDIA's special installers, but can lead to issues due to older libraries overwriting some newer ones. So I don't recommend installing old CUDA Toolkits in the host OS.
The easiest solution for users is to have Conda on their system, running
./install.sh
so that the OneTrainer Conda environment is created, and then executing this command inside the OneTrainer directory to install CUDA Toolkit 11.8 directly into theconda_env
directory:Conda makes it easy to provide the necessary CUDA Toolkit versions without needing to have anything on the host OS.
That's currently the newest 11.x version. And the best way to know what other toolkits are available is to look at this list:
https://anaconda.org/nvidia/cuda/labels