coreylowman / cudarc

Safe rust wrapper around CUDA toolkit
Apache License 2.0
566 stars 72 forks source link

cudarc fails to load libraries on official nvidia ubuntu images #274

Closed manifest closed 3 weeks ago

manifest commented 1 month ago

docker image: nvidia/cuda:12.5.1-runtime-ubuntu24.04 cudarc version: 0.11.7

Error message:

Unable to dynamically load the "cublas" shared library - searched for library names: ["cublas", "cublas64", "cublas64_12", "cublas64_125", "cublas64_125_0", "cublas64_120_5", "cublas64_10", "cublas64_120_0", "cublas64_9"]. Ensure that `LD_LIBRARY_PATH` has the correct path to the installed library. If the shared library is present on the system under a different name than one of those listed above, please open a GitHub issue.

Location of the libraries on nvidia/cuda:12.5.1-runtime-ubuntu24.04:

/usr/local/cuda/targets/x86_64-linux/lib/libOpenCL.so.1
/usr/local/cuda/targets/x86_64-linux/lib/libcublas.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libcublasLt.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libcufft.so.11
/usr/local/cuda/targets/x86_64-linux/lib/libcufftw.so.11
/usr/local/cuda/targets/x86_64-linux/lib/libcufile.so.0
/usr/local/cuda/targets/x86_64-linux/lib/libcufile_rdma.so.1
/usr/local/cuda/targets/x86_64-linux/lib/libcurand.so.10
/usr/local/cuda/targets/x86_64-linux/lib/libcusolver.so.11
/usr/local/cuda/targets/x86_64-linux/lib/libcusolverMg.so.11
/usr/local/cuda/targets/x86_64-linux/lib/libcusparse.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libnppc.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libnppial.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libnppicc.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libnppidei.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libnppif.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libnppig.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libnppim.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libnppist.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libnppisu.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libnppitc.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libnpps.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libnvJitLink.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libnvblas.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libnvfatbin.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libnvjpeg.so.12
/usr/local/cuda/targets/x86_64-linux/lib/libnvrtc-builtins.so.12.5
/usr/local/cuda/targets/x86_64-linux/lib/libnvrtc.so.12
coreylowman commented 1 month ago

Hmm I've always used the cuda devel docker images (e.g. 12.5.1-cudnn-devel-ubuntu20.04) and those have worked for me.

Can you try the devel images? If runtime images are necessary for you I can look into why they are different (I'm thinking the .12 at the end of the library name is messing up the dynamic loading searching).

Alternatively - You can disable dynamic loading in favor of using dynamic linking, and that will likely work.

manifest commented 1 month ago

Can you try the devel images? If runtime images are necessary for you I can look into why they are different (I'm thinking the .12 at the end of the library name is messing up the dynamic loading searching).

That could be the reason, because creating symlinks for the libraries above resolves the issue on runtime image. That work around works, but I would love to get rid of it :-)

We use devel image for the first build stage and then move binary to the runtime image to keep the image size small.

Hmm I've always used the cuda devel docker images (e.g. 12.5.1-cudnn-devel-ubuntu20.04) and those have worked for me.

We used nvidia/cuda:11.8.0-runtime-ubuntu22.04 with cudarc 0.10.0 and it worked fine. After upgrade to cudarc 0.11.7, we've got the problem.

I would have tested cudarc 0.11.7 against cuda:11, but some other dependency in our application now requires cuda:12.

manifest commented 1 month ago

Can you try the devel images?

I've just tried nvidia/cuda:12.5.1-devel-ubuntu24.04, it works fine.

Alternatively - You can disable dynamic loading in favor of using dynamic linking, and that will likely work.

How can I do that? We use cudarc as a dependency of candle.

coreylowman commented 1 month ago

Hmm looks like the main branch of candle is using dynamic linking already - are ya'll on an older version or a branch?

Also FYI there was a bug with 0.11.7, so recommend either upgrading to 0.11.8 or downgrading to 0.11.6 (which is the version candle is targetting).

I'll play around and see if I can get the dynamic loader to account for postfixes to the path. I'm not sure if we have that much control over pre & post fixes though (e.g. adding a .<major> after the .so), so symlinks may have to suffice for now.

coreylowman commented 1 month ago

BTW I don't see the driver library libcuda.so, it seems like the docker image only has the runtime library (libcudart.so). At minimum I think this would be blocked until #262 is merged AND candle would then have to swap over to using the runtime api feature which might take a bit longer.

Did ya'll see any errors related to not finding cuda? I would expect cudarc to fail first on that over cublas, so am wondering if ya'll already included that somehow.

manifest commented 1 month ago

I've upgraded to 0.11.8.

In logs, I see only the following message. The same message as before.

thread 'main' panicked at /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/cudarc-0.11.8/src/lib.rs:98:5:
Unable to dynamically load the "cublas" shared library - searched for library names: ["cublas", "cublas64", "cublas64_12", "cublas64_125", "cublas64_125_0", "cublas64_120_5", "cublas64_10", "cublas64_120_0", "cublas64_9"]. Ensure that `LD_LIBRARY_PATH` has the correct path to the installed library. If the shared library is present on the system under a different name than one of those listed above, please open a GitHub issue.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
coreylowman commented 1 month ago

Ah yeah, sorry for miscommunicating - upgrading to 0.11.8 won't fix the message in this issue

manifest commented 1 month ago

I've got you. Just wanted to clarify that I on the latest version in case you want me to test something :-)

maulberto3 commented 4 weeks ago

@coreylowman Wondering if LibreCuda might help?

manifest commented 3 weeks ago

I've built candle with cudarc from the master branch. This issue has been resolved by the latest commit. Thanks.

@coreylowman Are you planning on making a release?

manifest commented 3 weeks ago

Fixed