Open mkschleg opened 1 year ago
That's a strange error I haven't encountered. I take it you also don't have an NVIDIA driver on that system? In that case, it shouldn't even attempt to load a CUDNN library. Can you show the output of cuDNN.CUDNN_jll.host_platform
?
Running with LD_DEBUG=libs
might also reveal something.
So it is a Cluster, so the NVIDIA driver exists technically, just based on how modules are handled. I also have them in my path due to other python dependency weirdness for other projects.
julia> cuDNN.CUDNN_jll.host_platform
Linux x86_64 {cuda=11.8, cxxstring_abi=cxx11, julia_version=1.8.5, libc=glibc, libgfortran_version=5.0.0, libstdcxx_version=3.4.28}
I'm not sure what I'm looking for with LD_DEBUG=libs
. But this seems to be the relevant bit.
Please include the full LD_DEBUG output, you don't show which libcuda.so got loaded. Also please post a log with LD_DEBUG=libs of the situation where cuDNN did load correctly (i.e. with it part of the environment and loaded first), so that we can compare.
Oke Here we go!
Just realizing I misunderstood the ask for when it works. Here is when I load cuDNN first and then flux (which doesn't throw an exception)
Describe the bug
When loading Flux >=0.13.14 there is an init error for loading cuDNN (see related Flux issue: https://github.com/FluxML/Flux.jl/issues/2232). The error comes from loading the
libcudnn_cnn_infer.so
artifact. When we add cuDNN to theProject.toml
and then load cuDNN first and then Flux the issue dissipates.One possible piece of behavior I've noticed is when removing cuDNN (which only removes from the Project) and then adding it back in a new julia session, when first loading we also get the init error. Once it happens once though it doesn't continue (even in new sessions).
We are at a loss for debugging in Flux, so hopefully we can get some pointers here!
To reproduce
The Minimal Working Example (MWE) for this bug:
See above:
Statues of Manifest
``` Status `~/.julia/environments/v1.8/Manifest.toml` [621f4979] AbstractFFTs v1.3.1 [7d9f7c33] Accessors v0.1.29 [79e6a3ab] Adapt v3.6.1 [dce04be8] ArgCheck v2.3.0 [a9b6321e] Atomix v0.1.0 [ab4f0b2a] BFloat16s v0.4.2 [198e06fe] BangBang v0.3.37 [9718e550] Baselet v0.1.1 [fa961155] CEnum v0.4.2 [052768ef] CUDA v4.2.0 [1af6417a] CUDA_Runtime_Discovery v0.2.2 [082447d4] ChainRules v1.49.0 [d360d2e6] ChainRulesCore v1.15.7 [9e997f8a] ChangesOfVariables v0.1.7 [bbf7d656] CommonSubexpressions v0.3.0 [34da2185] Compat v4.6.1 [a33af91c] CompositionsBase v0.1.1 [187b0558] ConstructionBase v1.5.1 [6add18c4] ContextVariablesX v0.1.3 [9a962f9c] DataAPI v1.14.0 [864edb3b] DataStructures v0.18.13 [e2d170a0] DataValueInterfaces v1.0.0 [244e2a9f] DefineSingletons v0.1.2 [163ba53b] DiffResults v1.1.0 [b552c78f] DiffRules v1.13.0 [ffbed154] DocStringExtensions v0.9.3 [e2ba6199] ExprTools v0.1.9 [cc61a311] FLoops v0.2.1 [b9860ae5] FLoopsBase v0.1.1 [1a297f60] FillArrays v1.0.0 ⌃ [587475ba] Flux v0.13.14 [9c68100b] FoldsThreads v0.1.1 [f6369f11] ForwardDiff v0.10.35 [069b7b12] FunctionWrappers v1.1.3 [d9f16b24] Functors v0.4.4 [0c68f7d7] GPUArrays v8.6.6 [46192b85] GPUArraysCore v0.1.4 [61eb1bfa] GPUCompiler v0.19.3 [7869d1d1] IRTools v0.4.9 [22cec73e] InitialValues v0.3.1 [3587e190] InverseFunctions v0.1.9 [92d709cd] IrrationalConstants v0.2.2 [82899510] IteratorInterfaceExtensions v1.0.0 [692b3bcd] JLLWrappers v1.4.1 [b14d175d] JuliaVariables v0.2.4 [63c18a36] KernelAbstractions v0.9.4 [929cbde3] LLVM v5.0.0 [2ab3a3ac] LogExpFunctions v0.3.23 [d8e11817] MLStyle v0.4.17 [f1d291b0] MLUtils v0.4.2 [1914dd2f] MacroTools v0.5.10 [128add7d] MicroCollections v0.1.4 [e1d29d7a] Missings v1.1.0 [872c559c] NNlib v0.8.20 [a00861dc] NNlibCUDA v0.2.7 [77ba4419] NaNMath v1.0.2 [71a1bf82] NameResolution v0.1.5 [0b1bfda6] OneHotArrays v0.2.3 [3bd65402] Optimisers v0.2.18 [bac558e1] OrderedCollections v1.6.0 [aea7be01] PrecompileTools v1.0.2 [21216c6a] Preferences v1.3.0 [8162dcfd] PrettyPrint v0.2.0 [33c8b6b6] ProgressLogging v0.1.4 [74087812] Random123 v1.6.1 [e6cf234a] RandomNumbers v1.5.3 [c1ae055f] RealDot v0.1.0 [189a3867] Reexport v1.2.2 [ae029012] Requires v1.3.0 [6c6a2e73] Scratch v1.2.0 [efcf1570] Setfield v1.1.1 [605ecd9f] ShowCases v0.1.0 [699a6c99] SimpleTraits v0.9.4 [66db9d55] SnoopPrecompile v1.0.3 [a2af1166] SortingAlgorithms v1.1.0 [276daf66] SpecialFunctions v2.2.0 [171d559e] SplittablesBase v0.1.15 [90137ffa] StaticArrays v1.5.24 [1e83bf80] StaticArraysCore v1.4.0 [82ae8749] StatsAPI v1.6.0 ⌅ [2913bbd2] StatsBase v0.33.21 [09ab397b] StructArrays v0.6.15 [3783bdb8] TableTraits v1.0.1 [bd369af6] Tables v1.10.1 [a759f4b9] TimerOutputs v0.5.23 [28d57a85] Transducers v0.4.75 [013be700] UnsafeAtomics v0.2.1 [d80eeb9a] UnsafeAtomicsLLVM v0.1.2 [e88e6eb3] Zygote v0.6.60 [700de1a5] ZygoteRules v0.2.3 [02a925ec] cuDNN v1.0.3 [4ee394cb] CUDA_Driver_jll v0.5.0+1 [76a88914] CUDA_Runtime_jll v0.6.0+0 [62b44479] CUDNN_jll v8.8.1+0 [dad2f222] LLVMExtra_jll v0.0.21+0 [efe28fd5] OpenSpecFun_jll v0.5.5+0 [0dad84c5] ArgTools v1.1.1 [56f22d72] Artifacts [2a0f44e3] Base64 [ade2ca70] Dates [8bb1440f] DelimitedFiles [8ba89e20] Distributed [f43a241f] Downloads v1.6.0 [7b1f6079] FileWatching [9fa8497b] Future [b77e0a4c] InteractiveUtils [4af54fe1] LazyArtifacts [b27032c2] LibCURL v0.6.3 [76f85450] LibGit2 [8f399da3] Libdl [37e2e46d] LinearAlgebra [56ddb016] Logging [d6f4376e] Markdown [a63ad114] Mmap [ca575930] NetworkOptions v1.2.0 [44cfe95a] Pkg v1.8.0 [de0858da] Printf [3fa0cd96] REPL [9a3f8284] Random [ea8e919c] SHA v0.7.0 [9e88b42a] Serialization [6462fe0b] Sockets [2f01184e] SparseArrays [10745b16] Statistics [fa267f1f] TOML v1.0.0 [a4e569a6] Tar v1.10.1 [8dfed614] Test [cf7118a7] UUIDs [4ec0a83e] Unicode [e66e0078] CompilerSupportLibraries_jll v1.0.1+0 [deac9b47] LibCURL_jll v7.84.0+0 [29816b5a] LibSSH2_jll v1.10.2+0 [c8ffd9c3] MbedTLS_jll v2.28.0+0 [14a3606d] MozillaCACerts_jll v2022.2.1 [4536629a] OpenBLAS_jll v0.3.20+0 [05823500] OpenLibm_jll v0.8.1+0 [83775a58] Zlib_jll v1.2.12+3 [8e850b90] libblastrampoline_jll v5.1.1+0 [8e850ede] nghttp2_jll v1.48.0+0 [3f19e933] p7zip_jll v17.4.0+0 ```
Expected behavior
Flux loads with warning that CUDA device isn't detected.
Version info
Details on Julia:
Details on CUDA:
Additional context
There is no cuda device on the machine discussed.