Closed williamfgc closed 1 year ago
As mentioned in the docs, https://cuda.juliagpu.org/stable/installation/overview/#Containers, and tested by CI, https://github.com/JuliaGPU/CUDA.jl/blob/master/.buildkite/pipeline.yml#L254-L280, this is a supported configuration (for now) that just works. On a system without CUDA:
❯ JULIA_DEPOT=$(mktemp -d) julia +1.9
(@v1.9) pkg> add CUDA
Updating registry at `~/Julia/depot/registries/General.toml`
Resolving package versions...
Installed CUDA ─ v4.0.0
Updating `~/Julia/depot/environments/v1.9/Project.toml`
[052768ef] + CUDA v4.0.0
[0c68f7d7] ~ GPUArrays v8.5.0 `~/Julia/pkg/GPUArrays` ⇒ v8.6.0 `~/Julia/pkg/GPUArrays`
[61eb1bfa] ~ GPUCompiler v0.16.5 `~/Julia/pkg/GPUCompiler` ⇒ v0.17.1 `~/Julia/pkg/GPUCompiler`
[929cbde3] ~ LLVM v4.14.0 `~/Julia/pkg/LLVM` ⇒ v4.15.0 `~/Julia/pkg/LLVM`
Updating `~/Julia/depot/environments/v1.9/Manifest.toml`
[621f4979] + AbstractFFTs v1.2.1
[ab4f0b2a] + BFloat16s v0.4.2
[052768ef] + CUDA v4.0.0
[1af6417a] + CUDA_Runtime_Discovery v0.1.1
[0c68f7d7] ~ GPUArrays v8.5.0 `~/Julia/pkg/GPUArrays` ⇒ v8.6.0 `~/Julia/pkg/GPUArrays`
[46192b85] ~ GPUArraysCore v0.1.2 `../../../pkg/GPUArrays/lib/GPUArraysCore` ⇒ v0.1.3 `../../../pkg/GPUArrays/lib/GPUArraysCore`
[61eb1bfa] ~ GPUCompiler v0.16.5 `~/Julia/pkg/GPUCompiler` ⇒ v0.17.1 `~/Julia/pkg/GPUCompiler`
[929cbde3] ~ LLVM v4.14.0 `~/Julia/pkg/LLVM` ⇒ v4.15.0 `~/Julia/pkg/LLVM`
[74087812] + Random123 v1.6.0
[e6cf234a] + RandomNumbers v1.5.3
⌅ [4ee394cb] + CUDA_Driver_jll v0.2.0+0
⌅ [76a88914] + CUDA_Runtime_jll v0.2.3+2
Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated -m`
Precompiling environment...
4 dependencies successfully precompiled in 56 seconds. 47 already precompiled.
There seems to be something wrong with your system where libcuda.so
is an invalid library:
/usr/lib64/libcuda.so: undefined symbol: cuDriverGetVersion
@maleadt thanks, I am seeing more from the regression point of view as CUDA v4.0.0 is interacting with those libraries (they are exposed from the shared file system nature of Summit/Crusher, but the NVIDIA toolchain is missing), while CUDA v3.13.1 doesn't. Does it make sense?
Yeah, but we don't support that. The only way to detect the CUDA driver is to look for libcuda.so
and call it, if your system purposefully sets up an invalid library, well, that will break things.
What purpose exactly does that library have? Does it have any functions defined?
OK, thanks for the heads up this case is not supported any longer. It's the shared file system between Summit (IBM CPU+NVIDIA A100) and Crusher/Frontier (AMD Epyc/AMD MI250x GPU). My understanding from discourse is that CUDA.jl
functionality is loaded lazily and interaction with libcuda.so
should happen later, not when adding the package?
My understanding from discourse is that
CUDA.jl
functionality is loaded lazily and interaction withlibcuda.so
should happen later, not when adding the package?
That's true, but Pkg wants to download artifacts early (if possible) so we want to query the CUDA driver during installation. So everything remains lazy, for now, but we try to do some work early.
It's the shared file system between Summit (IBM CPU+NVIDIA A100) and Crusher/Frontier (AMD Epyc/AMD MI250x GPU).
I mean, why is there a broken library on the filesystem? What purpose does it serve? If it's reasonable for it to be there we should handle this case, of course.
That's true, but Pkg wants to download artifacts early (if possible) so we want to query the CUDA driver during installation.
Could an option in Pkg an option to avoid this and make possible a truly lazy loading behavior?
I mean, why is there a broken library on the filesystem?
It's typical transition and a few years overlap between an older HPC system and a newer one with a completely new architecture. Also, it's not unusual that the HPC system is accompanied with more commodity data analysis clusters of different architectures sharing the same file system (including system libraries as in this case). Complete isolation is not possible in this environment, that's why the previous full lazy loading model in v3.13.1 was perfect due to the lack of optional packages in Pkg
.
Nonetheless, thanks for all the hard work, I've been running CUDA.jl successfully on Summit nodes (with actual NVIDIA GPUs :) ). .
Also, it's not unusual that the HPC system is accompanied with more commodity data analysis clusters of different architectures sharing the same file system (including system libraries as in this case).
The issue here isn't that there's a libcuda
available, but that it's a broken one (as far as I can tell from what you've told me). For example, I just scp
d libcuda.so
from my GPU workstation to a random Linux system without a GPU or the NVIDIA kernel module, and I can call cuGetDriverVersion
perfectly.
@williamfgc can you post nm -D /usr/lib64/libcuda.so
? I echo Tim's question, what is this library and why does it exist.
It seems to not be a correct libcuda, so if we can detect that case we can handle it.
I see your point, thanks @maleadt and @vchuravy > nm -D /usr/lib64/libcuda.so
returns nothing. I will dig around and ask system admins. In the meantime, would checking existence of both nvidia-smi
and libcuda.so
a sensible workaround before further interaction? This is somehow mentioned in the docs already. I appreciate your input and it's probably a very niche case.Thoughts?
nvidia-smi
isn't always present.
What does file /usr/lib64/libcuda.so
tell you? or ls -la /usr/lib64/libcuda.so
?
Querying nvidia-smi
is not cheap. We can probably make this path more robust, but really installing a fake libcuda.so is quite devilish...
What does file /usr/lib64/libcuda.so tell you? or ls -la /usr/lib64/libcuda.so?
That something is really wrong /usr/lib64/libcuda.so -> /usr/lib64/libelf.so
in our system. I will file a ticket. Edit: yes, it's devilish
For the most part this is superseeded by the use of cuda/nvhpc modules (EDIT: which are not available on AMD's Crusher) as default systems tend to be very old (lifespan is about 7 to 10 years), now I wonder if there is a good reason for the above. Thank you both for your help, looking forward to a roadmap for optional packages in Julia.
FWIW, https://github.com/JuliaPackaging/Yggdrasil/pull/6187 will fix this.
looking forward to a roadmap for optional packages in Julia.
Weak dependencies are arriving in 1.9!
Thanks @maleadt and @vchuravy appreciated the awesome effort! We are giving a tutorial next week to HPC folks and CUDA.jl is a central component. I have to learn myself about weak dependencies!
Describe the bug Thanks for the great effort.
CUDA.jl
is expected to be installed in hardware without NVIDIA GPUs, see here so it can coexist withAMDGPU.jl
as proper dependencies in Project.toml, the lazy loading approach is appreciated for portability.To reproduce
The Minimal Working Example (MWE) for this bug: I'm testing on Crusher, which a test bed for Frontier (AMD system)
Version info julia 1.9.0-beta.
Details on CUDA: CUDA.jl v4.0.0 doesn't work CUDA.jl#v3.13.1 works as expected
N/A
Additional context Happy to help test this as it's more of a regression.