Regression on Pkg adding CUDA.jl v4.0.0 on AMD Crusher system

williamfgc commented 1 year ago

Describe the bug Thanks for the great effort. CUDA.jl is expected to be installed in hardware without NVIDIA GPUs, see here so it can coexist with AMDGPU.jl as proper dependencies in Project.toml, the lazy loading approach is appreciated for portability.

To reproduce

The Minimal Working Example (MWE) for this bug: I'm testing on Crusher, which a test bed for Frontier (AMD system)

pkg> add CUDA
CUDA [052768ef-5323-5732-b1bb-66c8b64840ba]

Failed to precompile CUDA [052768ef-5323-5732-b1bb-66c8b64840ba] to "/gpfs/alpine/proj-shared/csc383/etc/crusher/julia_depot/compiled/v1.9/CUDA/jl_jUHVb6".
ERROR: LoadError: InitError: could not load symbol "cuDriverGetVersion":
/usr/lib64/libcuda.so: undefined symbol: cuDriverGetVersion

pkg> add CUDA#v3.13.1
 Resolving package versions...
    Updating `/gpfs/alpine/csc383/proj-shared/wgodoy/ADIOS2-Examples/source/julia/GrayScott.jl/Project.toml`
  [052768ef] ~ CUDA v4.0.0 ⇒ v3.13.1 `https://github.com/JuliaGPU/CUDA.jl.git#v3.13.1`
    Updating `/gpfs/alpine/csc383/proj-shared/wgodoy/ADIOS2-Examples/source/julia/GrayScott.jl/Manifest.toml`
⌅ [ab4f0b2a] ↓ BFloat16s v0.4.2 ⇒ v0.2.0
  [052768ef] ~ CUDA v4.0.0 ⇒ v3.13.1 `https://github.com/JuliaGPU/CUDA.jl.git#v3.13.1`
  [1af6417a] - CUDA_Runtime_Discovery v0.1.1
  [4ee394cb] - CUDA_Driver_jll v0.2.0+0
  [76a88914] - CUDA_Runtime_jll v0.2.3+2
        Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated -m`
Precompiling environment...
  3 dependencies successfully precompiled in 56 seconds. 153 already precompiled.

Version info julia 1.9.0-beta.

Details on CUDA: CUDA.jl v4.0.0 doesn't work CUDA.jl#v3.13.1 works as expected

# please post the output of:
CUDA.versioninfo()

N/A

Additional context Happy to help test this as it's more of a regression.

maleadt commented 1 year ago

As mentioned in the docs, https://cuda.juliagpu.org/stable/installation/overview/#Containers, and tested by CI, https://github.com/JuliaGPU/CUDA.jl/blob/master/.buildkite/pipeline.yml#L254-L280, this is a supported configuration (for now) that just works. On a system without CUDA:

❯ JULIA_DEPOT=$(mktemp -d) julia +1.9
(@v1.9) pkg> add CUDA
    Updating registry at `~/Julia/depot/registries/General.toml`
   Resolving package versions...
   Installed CUDA ─ v4.0.0
    Updating `~/Julia/depot/environments/v1.9/Project.toml`
  [052768ef] + CUDA v4.0.0
  [0c68f7d7] ~ GPUArrays v8.5.0 `~/Julia/pkg/GPUArrays` ⇒ v8.6.0 `~/Julia/pkg/GPUArrays`
  [61eb1bfa] ~ GPUCompiler v0.16.5 `~/Julia/pkg/GPUCompiler` ⇒ v0.17.1 `~/Julia/pkg/GPUCompiler`
  [929cbde3] ~ LLVM v4.14.0 `~/Julia/pkg/LLVM` ⇒ v4.15.0 `~/Julia/pkg/LLVM`
    Updating `~/Julia/depot/environments/v1.9/Manifest.toml`
  [621f4979] + AbstractFFTs v1.2.1
  [ab4f0b2a] + BFloat16s v0.4.2
  [052768ef] + CUDA v4.0.0
  [1af6417a] + CUDA_Runtime_Discovery v0.1.1
  [0c68f7d7] ~ GPUArrays v8.5.0 `~/Julia/pkg/GPUArrays` ⇒ v8.6.0 `~/Julia/pkg/GPUArrays`
  [46192b85] ~ GPUArraysCore v0.1.2 `../../../pkg/GPUArrays/lib/GPUArraysCore` ⇒ v0.1.3 `../../../pkg/GPUArrays/lib/GPUArraysCore`
  [61eb1bfa] ~ GPUCompiler v0.16.5 `~/Julia/pkg/GPUCompiler` ⇒ v0.17.1 `~/Julia/pkg/GPUCompiler`
  [929cbde3] ~ LLVM v4.14.0 `~/Julia/pkg/LLVM` ⇒ v4.15.0 `~/Julia/pkg/LLVM`
  [74087812] + Random123 v1.6.0
  [e6cf234a] + RandomNumbers v1.5.3
⌅ [4ee394cb] + CUDA_Driver_jll v0.2.0+0
⌅ [76a88914] + CUDA_Runtime_jll v0.2.3+2
        Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated -m`
Precompiling environment...
  4 dependencies successfully precompiled in 56 seconds. 47 already precompiled.

There seems to be something wrong with your system where libcuda.so is an invalid library:

/usr/lib64/libcuda.so: undefined symbol: cuDriverGetVersion

williamfgc commented 1 year ago

@maleadt thanks, I am seeing more from the regression point of view as CUDA v4.0.0 is interacting with those libraries (they are exposed from the shared file system nature of Summit/Crusher, but the NVIDIA toolchain is missing), while CUDA v3.13.1 doesn't. Does it make sense?

maleadt commented 1 year ago

Yeah, but we don't support that. The only way to detect the CUDA driver is to look for libcuda.so and call it, if your system purposefully sets up an invalid library, well, that will break things.

What purpose exactly does that library have? Does it have any functions defined?

williamfgc commented 1 year ago

OK, thanks for the heads up this case is not supported any longer. It's the shared file system between Summit (IBM CPU+NVIDIA A100) and Crusher/Frontier (AMD Epyc/AMD MI250x GPU). My understanding from discourse is that CUDA.jl functionality is loaded lazily and interaction with libcuda.so should happen later, not when adding the package?

maleadt commented 1 year ago

My understanding from discourse is that CUDA.jl functionality is loaded lazily and interaction with libcuda.so should happen later, not when adding the package?

That's true, but Pkg wants to download artifacts early (if possible) so we want to query the CUDA driver during installation. So everything remains lazy, for now, but we try to do some work early.

It's the shared file system between Summit (IBM CPU+NVIDIA A100) and Crusher/Frontier (AMD Epyc/AMD MI250x GPU).

I mean, why is there a broken library on the filesystem? What purpose does it serve? If it's reasonable for it to be there we should handle this case, of course.

williamfgc commented 1 year ago

That's true, but Pkg wants to download artifacts early (if possible) so we want to query the CUDA driver during installation.

Could an option in Pkg an option to avoid this and make possible a truly lazy loading behavior?

I mean, why is there a broken library on the filesystem?

It's typical transition and a few years overlap between an older HPC system and a newer one with a completely new architecture. Also, it's not unusual that the HPC system is accompanied with more commodity data analysis clusters of different architectures sharing the same file system (including system libraries as in this case). Complete isolation is not possible in this environment, that's why the previous full lazy loading model in v3.13.1 was perfect due to the lack of optional packages in Pkg.

Nonetheless, thanks for all the hard work, I've been running CUDA.jl successfully on Summit nodes (with actual NVIDIA GPUs :) ). .

maleadt commented 1 year ago

Also, it's not unusual that the HPC system is accompanied with more commodity data analysis clusters of different architectures sharing the same file system (including system libraries as in this case).

The issue here isn't that there's a libcuda available, but that it's a broken one (as far as I can tell from what you've told me). For example, I just scpd libcuda.so from my GPU workstation to a random Linux system without a GPU or the NVIDIA kernel module, and I can call cuGetDriverVersion perfectly.

vchuravy commented 1 year ago

@williamfgc can you post nm -D /usr/lib64/libcuda.so? I echo Tim's question, what is this library and why does it exist. It seems to not be a correct libcuda, so if we can detect that case we can handle it.

williamfgc commented 1 year ago

I see your point, thanks @maleadt and @vchuravy > nm -D /usr/lib64/libcuda.so returns nothing. I will dig around and ask system admins. In the meantime, would checking existence of both nvidia-smi and libcuda.so a sensible workaround before further interaction? This is somehow mentioned in the docs already. I appreciate your input and it's probably a very niche case.Thoughts?

maleadt commented 1 year ago

nvidia-smi isn't always present.

What does file /usr/lib64/libcuda.so tell you? or ls -la /usr/lib64/libcuda.so?

vchuravy commented 1 year ago

Querying nvidia-smi is not cheap. We can probably make this path more robust, but really installing a fake libcuda.so is quite devilish...

williamfgc commented 1 year ago

What does file /usr/lib64/libcuda.so tell you? or ls -la /usr/lib64/libcuda.so?

That something is really wrong /usr/lib64/libcuda.so -> /usr/lib64/libelf.so in our system. I will file a ticket. Edit: yes, it's devilish

williamfgc commented 1 year ago

For the most part this is superseeded by the use of cuda/nvhpc modules (EDIT: which are not available on AMD's Crusher) as default systems tend to be very old (lifespan is about 7 to 10 years), now I wonder if there is a good reason for the above. Thank you both for your help, looking forward to a roadmap for optional packages in Julia.

maleadt commented 1 year ago

FWIW, https://github.com/JuliaPackaging/Yggdrasil/pull/6187 will fix this.

looking forward to a roadmap for optional packages in Julia.

Weak dependencies are arriving in 1.9!

williamfgc commented 1 year ago

Thanks @maleadt and @vchuravy appreciated the awesome effort! We are giving a tutorial next week to HPC folks and CUDA.jl is a central component. I have to learn myself about weak dependencies!

JuliaGPU / CUDA.jl

Regression on Pkg adding CUDA.jl v4.0.0 on AMD Crusher system #1753