Closed Thresher12 closed 10 months ago
This is a problem with the cuda driver not being found. There is a work-around that I will implement soon that should fix this issue. Currently, the best way is to find libcuda.so
on your system and make it visible to the slurm jobs.
It can also happen when sharing the same Docker image between GPU and CPU models.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
I'm attempting to run a training job from a program that uses bitsandbytes on a remote SLURM computer cluster. From the errors it looks like its some issue with CUDA maybe. I've tried to look up solutions but I couldn't find any straightforward solution.
Heres the error when attempting to start training
Here is the report from -m bitsandbytes
And here is my job submission header in case its relevant.