find_libcudadevrt doesn't work on our cluster installation

simonbyrne commented 3 years ago

Describe the bug

On our cluster the CUDA installation (both 10.2 and 11.2) places libcudadevrt.a under targets/x86_64-linux/lib/libcudadevrt.a. find_libcudadevrt doesn't search this directory.

Manually editing deps/discovery.jl lets this work and all other libraries are found correctly.

Can we either add that directory to the search path, or add an environment variable that lets us specify the path manually?

cc: @jakebolewski

maleadt commented 3 years ago

libcudadevrt always resides in that directory, but there should be a lib64 -> targets/x86_64-linux/lib/ link. Together with the libcuda.so issue, I have a feeling your CUDA distribution is a little messed up.

That said, I'm not really opposed to adding some additional code here: https://github.com/JuliaGPU/CUDA.jl/blob/631e278b56a6355492b4722382c1bec1b323e8af/deps/discovery.jl#L544-L547 (maybe add a comment about the missing link though).

jakebolewski commented 3 years ago

I do think the issue is with how (this cluster's) particular Cuda installation is setup. CUDA_HOME in the module envrionemnt points to the globally installed version of the cuda assets but for the GPU nodes the shared and static libraries are installed under /usr/lib64 (for the versioned so) and /usr/local/cuda-11.2/ (for the static library and other supporting libraries).

julia> print(ENV["CUDA_HOME"])
/central/software/CUDA/11.2

shell> /usr/lib64/libcuda
libcuda.so.1             libcuda_wrapper.so        libcuda.so.460.32.03      libcuda.so
libcuda_wrapper.la       libcuda_wrapper.so.0      libcuda_wrapper.so.0.0.0

shell> /usr/local/cuda-11.2/lib64/libcuda
libcudart_static.a   libcudart.so          libcudart.so.11.2.72  libcudart.so.11.0     libcudadevrt.a

Not sure the best way to resolve this particular setup (maybe just selectively re-direct CUDA_HOME on GPU nodes?) or if we could add another JULIA_CUDA_ env variable to inject other search paths in toolkit_dirs for weird cluster setups with login node / cpu / gpu node differences.

maleadt commented 3 years ago

An alternative thought is to remove the local CUDA detection altogether, fully bet on artifacts, and have cluster users provide an Overrides.toml which should give you the necessary flexibility (although at a usability cost). But that requires some additional work on the artifact side (probably including the CUDA version in the triple), so a temporary hack with env vars is OK for now.

JuliaGPU / CUDA_Runtime_Discovery.jl

find_libcudadevrt doesn't work on our cluster installation #12