Closed tomchor closed 2 months ago
Illegal instruction errors are typically caused by bugs in Julia itself, and not in CUDA.jl. I would recommend trying out an assertions build, which may reveal additional information.
As this also seems to happen during initialization of LLVM.jl, can you try just using LLVM
and see if that reproduces the issue?
Details on CUDA:
Unfortunately I can't get it:
julia> CUDA.versioninfo() ERROR: CUDA initialization failed
I'm confused here; does this mean CUDA.jl never works?
Illegal instruction errors are typically caused by bugs in Julia itself, and not in CUDA.jl. I would recommend trying out an assertions build, which may reveal additional information.
I'll look into that, but I haven't changed anything on my Julia install or anything, and it used to work. So I'm really at a loss here.
As this also seems to happen during initialization of LLVM.jl, can you try just
using LLVM
and see if that reproduces the issue?
I'll try that out soon and post results.
Details on CUDA: Unfortunately I can't get it:
julia> CUDA.versioninfo() ERROR: CUDA initialization failed
I'm confused here; does this mean CUDA.jl never works?
Just to clarify, I can get CUDA to work if remove everything from $JULIA_DEPOT_PATH
and reinstantiate
, but for some reason even with that I get that erroe when trying out CUDA.versioninfo()
. Not sure why.
I can get CUDA to work if remove everything from
$JULIA_DEPOT_PATH
and reinstantiate
, but for some reason even with that I get that erroe when trying outCUDA.versioninfo()
. Not sure why.
That error indicates cuInit
failed, so I have a hard time understanding how anything else in CUDA.jl would work in that case. Please check dmesg
, there might be a NVIDIA-driver related error reported in there.
I'll close this because I tried running this again today (after re-compiling everything, which is something I had done in the past) and for some reason things are now working. I didn't do anything different from some previous attempts so my best guess is that the system admin changed something relevant.
Thanks for the help, @maleadt!
Thanks for the update!
Describe the bug
Most of the time things work fine. Once in a while I'll get an error on the line
using CUDA
. After that happens once, I cannot use CUDA again until I delete everything on$JULIA_DEPOT_PATH
and re-instantiate everything from scratch.The start of the error is:
To reproduce
The Minimal Working Example (MWE) for this bug:
Manifest.toml
Below are the part of Manifest.toml related to CUDA.jl, GPUArrays.jl, GPUCompiler.jl, LLVM.jl ``` [[deps.CUDA]] deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CUDA_Driver_jll", "CUDA_Runtime_Discovery", "CUDA_Runtime_jll", "CompilerSupportLibraries_jll", "ExprTools", "GPUArrays", "GPUCompiler", "KernelAbstractions", "LLVM", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "Preferences", "Printf", "Random", "Random123", "RandomNumbers", "Reexport", "Requires", "SparseArrays", "SpecialFunctions", "UnsafeAtomicsLLVM"] git-tree-sha1 = "442d989978ed3ff4e174c928ee879dc09d1ef693" uuid = "052768ef-5323-5732-b1bb-66c8b64840ba" version = "4.3.2" [[deps.CUDA_Driver_jll]] deps = ["Artifacts", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg"] git-tree-sha1 = "498f45593f6ddc0adff64a9310bb6710e851781b" uuid = "4ee394cb-3365-5eb0-8335-949819d2adfc" version = "0.5.0+1" [[deps.CUDA_Runtime_Discovery]] deps = ["Libdl"] git-tree-sha1 = "bcc4a23cbbd99c8535a5318455dcf0f2546ec536" uuid = "1af6417a-86b4-443c-805f-a4643ffb695f" version = "0.2.2" [[deps.CUDA_Runtime_jll]] deps = ["Artifacts", "CUDA_Driver_jll", "JLLWrappers", "LazyArtifacts", "Libdl", "TOML"] git-tree-sha1 = "5248d9c45712e51e27ba9b30eebec65658c6ce29" uuid = "76a88914-d11a-5bdc-97e0-2f5a05c973a2" version = "0.6.0+0" [[deps.GPUArrays]] deps = ["Adapt", "GPUArraysCore", "LLVM", "LinearAlgebra", "Printf", "Random", "Reexport", "Serialization", "Statistics"] git-tree-sha1 = "a3351bc577a6b49297248aadc23a4add1097c2ac" uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7" version = "8.7.1" [[deps.GPUArraysCore]] deps = ["Adapt"] git-tree-sha1 = "2d6ca471a6c7b536127afccfa7564b5b39227fe0" uuid = "46192b85-c4d5-4398-a991-12ede77f4527" version = "0.1.5" [[deps.GPUCompiler]] deps = ["ExprTools", "InteractiveUtils", "LLVM", "Libdl", "Logging", "Scratch", "TimerOutputs", "UUIDs"] git-tree-sha1 = "cb090aea21c6ca78d59672a7e7d13bd56d09de64" uuid = "61eb1bfa-7361-4325-ad38-22787b887f55" version = "0.20.3" [[deps.LLVM]] deps = ["CEnum", "LLVMExtra_jll", "Libdl", "Printf", "Unicode"] git-tree-sha1 = "5007c1421563108110bbd57f63d8ad4565808818" uuid = "929cbde3-209d-540e-8aea-75f648917ca0" version = "5.2.0" [[deps.LLVMExtra_jll]] deps = ["Artifacts", "JLLWrappers", "LazyArtifacts", "Libdl", "TOML"] git-tree-sha1 = "1222116d7313cdefecf3d45a2bc1a89c4e7c9217" uuid = "dad2f222-ce93-54a1-a47d-0025e8a3acab" version = "0.0.22+0" [[deps.LLVMOpenMP_jll]] deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"] git-tree-sha1 = "f689897ccbe049adb19a065c495e75f372ecd42b" uuid = "1d63c593-3942-5779-bab2-d838dc0a180e" version = "15.0.4+0" ```
Version info
Details on Julia:
Details on CUDA:
Unfortunately I can't get it:
Additional context
Technically I guess I can continue re-compiling my whole Julia environment every time this error happens, but I'd really want to try and avoid that since the environment is complex and it takes a long time. Another note is that compatibility issues regarding the machine and other software I'm using prevent me from using the latest CUDA version and Julia 1.10.
CC @loganpknudsen