Buggy precompilation of init-defined symbols can break CUDA_Driver_jll initialization #1798

simonbyrne commented 1 year ago

Describe the bug

I'm using CUDA.jl on a HPC cluster, configured to use the system CUDA installation via Preferences.jl.

I've then precompiled on one node, but loaded it on another node, I occasionally get the following error:

ERROR: LoadError: InitError: UndefVarError: libcuda not defined
  [1] getproperty
    @ ./Base.jl:31 [inlined]
  [2] __init__()
    @ CUDA /central/scratch/esm/slurm-buildkite/climacore-ci/1859/depot/default/packages/CUDA/ZdCxS/src/initialization.jl:42
  [3] _include_from_serialized(pkg::Base.PkgId, path::String, depmods::Vector{Any})
    @ Base ./loading.jl:831
  [4] _tryrequire_from_serialized(modkey::Base.PkgId, path::String, sourcepath::String, depmods::Vector{Any})
    @ Base ./loading.jl:938
  [5] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String, build_id::UInt64)
    @ Base ./loading.jl:1028
  [6] _require(pkg::Base.PkgId)
    @ Base ./loading.jl:1315
  [7] _require_prelocked(uuidkey::Base.PkgId)
    @ Base ./loading.jl:1200
  [8] macro expansion
    @ ./loading.jl:1180 [inlined]
  [9] macro expansion
    @ ./lock.jl:223 [inlined]
 [10] require(into::Module, mod::Symbol)
    @ Base ./loading.jl:1144
during initialization of module CUDA

e.g. https://buildkite.com/clima/climacore-ci/builds/1859#0186dc49-e196-4714-9bd2-ac2f4ad07ac9/137-139

If I load it at the REPL, I get isdefined(CUDA_Driver_jll, :libcuda) == false.

It goes away if I delete the .ji file for CUDA_Driver_jll and reload.

To reproduce

I'm still not sure exactly how to reproduce it, will see if I can figure it out.

simonbyrne commented 1 year ago

I'm also seeing it without the CUDA_Runtime_jll preferences set.

vchuravy commented 1 year ago

So CUDA_Driver_jll initializes the cuda variable during init:


vchuravy commented 1 year ago

If you can reproduce it with JULIA_DEBUG="CUDA_Driver_jll" that would be great.

simonbyrne commented 1 year ago

When re-running it:

julia> using CUDA
┌ Debug: No system CUDA driver found
└ @ CUDA_Driver_jll /central/scratch/esm/slurm-buildkite/climacore-ci/1862/depot/default/packages/CUDA_Driver_jll/9E4Mc/src/wrappers/x86_64-linux-gnu.jl:54
ERROR: InitError: UndefVarError: libcuda not defined
 [1] getproperty
   @ ./Base.jl:31 [inlined]
 [2] __init__()
   @ CUDA /central/scratch/esm/slurm-buildkite/climacore-ci/1862/depot/default/packages/CUDA/ZdCxS/src/initialization.jl:42
 [3] _include_from_serialized(pkg::Base.PkgId, path::String, depmods::Vector{Any})
   @ Base ./loading.jl:831
 [4] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String, build_id::UInt64)
   @ Base ./loading.jl:1039
 [5] _require(pkg::Base.PkgId)
   @ Base ./loading.jl:1315
 [6] _require_prelocked(uuidkey::Base.PkgId)
   @ Base ./loading.jl:1200
 [7] macro expansion
   @ ./loading.jl:1180 [inlined]
 [8] macro expansion
   @ ./lock.jl:223 [inlined]
 [9] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:1144
during initialization of module CUDA

Would it be more robust to define it initially to make it always defined? i.e. change the above linked line to

global libcuda = nothing
maleadt commented 1 year ago

There's an isdefined check in there so I'm not sure why the getproperty fails: https://github.com/JuliaGPU/CUDA.jl/blob/940d23d5b9a82e50f79a16ea46d13ca885a4d2de/src/initialization.jl#L38-L47

simonbyrne commented 1 year ago

I've managed to reliably recreate it, using the default configuration (i.e. it is unrelated to using the system CUDA runtime).

  1. Instantiate and precompile on a node with a GPU:
    ┌─[4]──[Tue Mar 14]──[10:07:53]────────────────────────────────────────
    │ spjbyrne@hpc-21-18:~/misc/cudax
    ├ julia --project
    _       _ _(_)_     |  Documentation: https://docs.julialang.org
    (_)     | (_) (_)    |
    _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
    | | | | | | |/ _` |  |
    | | |_| | | | (_| |  |  Version 1.8.5 (2023-01-08)
    _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
    |__/                   |

(cudax) pkg> instantiate Precompiling project... 2 dependencies successfully precompiled in 51 seconds. 35 already precompiled.

julia> using CUDA

julia> using CUDA_Driver_jll

julia> CUDA_Driver_jll.is_available() true

julia> CUDA_Driver_jll.libcuda "/home/spjbyrne/.julia/artifacts/b5e755e06f4d49a5ab1a638eea5d75bf20c66e3d/lib/libcuda.so"

2. On a node without a GPU, attempt to load CUDA.jl

┌─[6]──[Tue Mar 14]──[10:16:23]──────────────────────────────────────── │ spjbyrne@hpc-21-14:~/misc/cudax ├ julia --project () | Documentation: https://docs.julialang.org () | () () | | |_ | Type "?" for help, "]?" for Pkg help. | | | | | | |/ ` | | | | || | | | (| | | Version 1.8.5 (2023-01-08) / |_'|||_'_| | Official https://julialang.org/ release |/ |

julia> using CUDA ERROR: InitError: UndefVarError: libcuda not defined Stacktrace: [1] getproperty @ ./Base.jl:31 [inlined] [2] init() @ CUDA ~/.julia/packages/CUDA/ZdCxS/src/initialization.jl:42 [3] _include_from_serialized(pkg::Base.PkgId, path::String, depmods::Vector{Any}) @ Base ./loading.jl:831 [4] _require_search_from_serialized(pkg::Base.PkgId, sourcepath::String, build_id::UInt64) @ Base ./loading.jl:1039 [5] _require(pkg::Base.PkgId) @ Base ./loading.jl:1315 [6] _require_prelocked(uuidkey::Base.PkgId) @ Base ./loading.jl:1200 [7] macro expansion @ ./loading.jl:1180 [inlined] [8] macro expansion @ ./lock.jl:223 [inlined] [9] require(into::Module, mod::Symbol) @ Base ./loading.jl:1144 during initialization of module CUDA

julia> using CUDA_Driver_jll

julia> CUDA_Driver_jll.is_available() true

julia> CUDA_Driver_jll.libcuda ERROR: UndefVarError: libcuda not defined Stacktrace: [1] getproperty(x::Module, f::Symbol) @ Base ./Base.jl:31 [2] top-level scope @ REPL[4]:1

julia> isdefined(CUDA_Driver_jll, :libcuda) false

My guess is that the `isdefined(CUDA_Driver_jll, :libcuda)` is incorrectly assumed to be constant during precompilation?


(@v1.8) pkg>
maleadt commented 1 year ago

So the Nu66l version slug used above is from CUDA_Runtime_jll 0.5. Even though that package depends on CUDA_Driver_jll 0.5, apparently it gets loaded during upgrades at the time only CUDA_Driver_jll 0.4 is available. That's very annoying, and essentially means we can't rely on compatibility bounds for package augmentation hooks...

maleadt commented 1 year ago

https://github.com/JuliaRegistries/General/pull/81742 should fix this, hopefully