coreylowman / cudarc

Safe rust wrapper around CUDA toolkit
Apache License 2.0
593 stars 73 forks source link

Unable to find cuda lib under the names ["cuda", "nvcuda"] on WSL #255

Closed EricLBuehler closed 3 months ago

EricLBuehler commented 3 months ago

Hello @coreylowman!

I was running mistral.rs on a WSL machine when I ran into this error:

thread 'main' panicked at /home/mbuehler/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.5/src/driver/sys/mod.rs:61:9:
Unable to find cuda lib under the names ["cuda", "nvcuda"]. Please open GitHub issue.

I have CUDA 12.2 installed, but my LD_LIBRARY_PATHdoes not point to a path where libcuda.so can be found. When I manually set it to point there, everything works. This problem has not been reported, so I am wondering if it is only with my machine. Regardless it would be helpful to have an error message indicating to check LD_LIBRARY_PATH.

coreylowman commented 3 months ago

Yeah candle has encountered issues with this as well (linking to #219 and https://github.com/huggingface/candle/issues/2175).

I will update the error message tomorrow

maulberto3 commented 4 weeks ago

Hello @coreylowman!

I was running mistral.rs on a WSL machine when I ran into this error:

thread 'main' panicked at /home/mbuehler/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.5/src/driver/sys/mod.rs:61:9:
Unable to find cuda lib under the names ["cuda", "nvcuda"]. Please open GitHub issue.

I have CUDA 12.2 installed, but my LD_LIBRARY_PATHdoes not point to a path where libcuda.so can be found. When I manually set it to point there, everything works. This problem has not been reported, so I am wondering if it is only with my machine. Regardless it would be helpful to have an error message indicating to check LD_LIBRARY_PATH.

Hello @EricLBuehler, I have experienced the exact same problem. Can you show me how to make it work?

maulberto3 commented 3 weeks ago

Hi @coreylowman Issue still persists in .012.1, odd is that it always does dynamic cuda linking even when default-features=false and no "dynamic-linking" in features as error is always Unable to dynamically load the "cuda" shared library - searched for library names: ["cuda", "nvcuda"]...

Therefore, can't disable at all dynamic linking and hopefully make it work with runtime env vars.

Lastly, trying to make it work by the forced dynamic linking, maybe we should include other root strings, as mine WSL is at /usr/local/cuda-12.6/ which is not part at all of here

Oh, symbolic links didn't work either, so I don't know what else to try in WSL.

ChaseLewis commented 4 days ago

Is there a way around this issue? My libcuda.so is located at /usr/local/cuda-12.4/lib64/stubs/libcuda.so there is a symlink at /usr/local/cuda/lib64/stubs/libcuda.so. I've tried all sorts of variations on the LD_LIBRARY_PATH env and none seem to work. Currently have env variables set like this.

CUDA_HOME="/usr/local/cuda-12.4"
CUDA_MAJOR_VERSION="12"
CUDA_MINOR_VERSION="4"
CUDA_PATH=$CUDA_HOME
CUDA_ROOT=$CUDA_HOME
CUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME
LD_LIBRARY_PATH="$CUDA_HOME/lib64:$CUDA_HOME/lib64/stubs"
PATH="$PATH:$CUDA_HOME/bin"
maulberto3 commented 3 days ago

Is there a way around this issue? My libcuda.so is located at /usr/local/cuda-12.4/lib64/stubs/libcuda.so there is a symlink at /usr/local/cuda/lib64/stubs/libcuda.so. I've tried all sorts of variations on the LD_LIBRARY_PATH env and none seem to work. Currently have env variables set like this.

CUDA_HOME="/usr/local/cuda-12.4"
CUDA_MAJOR_VERSION="12"
CUDA_MINOR_VERSION="4"
CUDA_PATH=$CUDA_HOME
CUDA_ROOT=$CUDA_HOME
CUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME
LD_LIBRARY_PATH="$CUDA_HOME/lib64:$CUDA_HOME/lib64/stubs"
PATH="$PATH:$CUDA_HOME/bin"

I see two possible solutions, one less diligent than the other: 1) Make a pull request to the team, explaining and fixing the issue (TBH I should have done this already, but haven't...); 2) Clone the repo yourself, edit source code and use that instead.

maulberto3 commented 3 hours ago

Is there a way around this issue? My libcuda.so is located at /usr/local/cuda-12.4/lib64/stubs/libcuda.so there is a symlink at /usr/local/cuda/lib64/stubs/libcuda.so. I've tried all sorts of variations on the LD_LIBRARY_PATH env and none seem to work. Currently have env variables set like this.

CUDA_HOME="/usr/local/cuda-12.4"
CUDA_MAJOR_VERSION="12"
CUDA_MINOR_VERSION="4"
CUDA_PATH=$CUDA_HOME
CUDA_ROOT=$CUDA_HOME
CUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME
LD_LIBRARY_PATH="$CUDA_HOME/lib64:$CUDA_HOME/lib64/stubs"
PATH="$PATH:$CUDA_HOME/bin"

I see two possible solutions, one less diligent than the other: 1) Make a pull request to the team, explaining and fixing the issue (TBH I should have done this already, but haven't...); 2) Clone the repo yourself, edit source code and use that instead.

@ChaseLewis I went ahead and did the 2nd option, still no luck. I found out that it should work as the cuda toolkit installation does a symlink to /usr/local/cuda/ from whatever latest cuda installation you do. I did s fresh cleanup and ground up new installation and now I get Error: DriverError(CUDA_ERROR_NO_DEVICE, "no CUDA-capable device is detected") clearly mistaken. Using the same env vars, my simple (separate) cuda code works well i.e. meaning the installation and hardware are good, but not my simple cudarc code (as I am trying to isolate and troubleshoot cudarc in WSL2 from candle).

At least the Unable to dynamically load (...) error stopped. However, considering the new error, I don't know if I'm better or worse off...

I hope to continue troubleshooting cudarc in WSL2.