Closed renatomserra closed 2 months ago
ImportError: libnccl.so.2: cannot open shared object file: No such file or directory
Sounds like nvidia drivers are not installed correctly.
hmm strange, im following the guide like i have been before and it stopped working 🤔
Sometimes something as simple as an apt update
can bork nvidia drivers. What does nvidia-smi
show?
Tried apt update, no change
nvidia-smi:
Hmm, ok.
https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html
Go there and try following the NCCL installation instructions.
Will give this a try
are you still able to run simpletuner in vast ai instances using the same docker image in the docs?
I had been as of a week ago
Yeah it was working for me until 2 days ago.
@bghira says to try the pytorch/pytorch_2.4.0-cuda12.4-cudnn9-devel image. If that helps I will update the guide.
i started that one up freshly on a 3090, 4090, A100 and H100 to test and they all worked well. the problem is the default image selected by some vendors like Vast has CUDA 11.8 or 11.5 in there (yikes) and pytorch 2.6 no longer supports these
YEap just tested with that image and it does work, thanks a lot guys!
Hello started getting this error with the same container, any ideas?
my config: