JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.21k stars 219 forks source link

`Invalid instruction` error when `using CUDA` #2454

Closed tomchor closed 2 months ago

tomchor commented 2 months ago

Describe the bug

Most of the time things work fine. Once in a while I'll get an error on the line using CUDA. After that happens once, I cannot use CUDA again until I delete everything on $JULIA_DEPOT_PATH and re-instantiate everything from scratch.

The start of the error is:

Invalid instruction at 0x1521aea9e2d8: 0x62, 0xf1, 0x7d, 0x08, 0x76, 0xc1, 0xc5, 0xf8, 0x44, 0xc0, 0xc5, 0xfb, 0x93, 0xc0, 0xa8

[118481] signal (4.2): Illegal instruction
in expression starting at /glade/derecho/scratch/tomasc/test_CUDA2/front.jl:1
__init__ at /glade/work/tomasc/.julia6/packages/LLVM/5aiiG/src/LLVM.jl:103
jfptr___init___927 at /glade/work/tomasc/.julia6/compiled/v1.9/LLVM/e8NBy_8n5Ta.so (unknown line)
_jl_invoke at /glade/derecho/scratch/csgteam/temp/spack/casper/23.10/builds/spack-stage-julia-1.9.2-mjeadxih745lj3s24lbol2ou7lpwqtse/spack-src/src/gf.c:2758 [inlined]
ijl_apply_generic at /glade/derecho/scratch/csgteam/temp/spack/casper/23.10/builds/spack-stage-julia-1.9.2-mjeadxih745lj3s24lbol2ou7lpwqtse/spack-src/src/gf.c:2940
jl_apply at /glade/derecho/scratch/csgteam/temp/spack/casper/23.10/builds/spack-stage-julia-1.9.2-mjeadxih745lj3s24lbol2ou7lpwqtse/spack-src/src/julia.h:1879 [inlined]
jl_module_run_initializer at /glade/derecho/scratch/csgteam/temp/spack/casper/23.10/builds/spack-stage-julia-1.9.2-mjeadxih745lj3s24lbol2ou7lpwqtse/spack-src/src/toplevel.c:75
ijl_init_restored_modules at /glade/derecho/scratch/csgteam/temp/spack/casper/23.10/builds/spack-stage-julia-1.9.2-mjeadxih745lj3s24lbol2ou7lpwqtse/spack-src/src/module.c:982
register_restored_modules at ./loading.jl:1115
_include_from_serialized at ./loading.jl:1061
_tryrequire_from_serialized at ./loading.jl:1391
_require_search_from_serialized at ./loading.jl:1494
_require at ./loading.jl:1783
_require_prelocked at ./loading.jl:1660
macro expansion at ./loading.jl:1648 [inlined]
macro expansion at ./lock.jl:267 [inlined]
require at ./loading.jl:1611

To reproduce

The Minimal Working Example (MWE) for this bug:

using CUDA
Manifest.toml

Below are the part of Manifest.toml related to CUDA.jl, GPUArrays.jl, GPUCompiler.jl, LLVM.jl ``` [[deps.CUDA]] deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CUDA_Driver_jll", "CUDA_Runtime_Discovery", "CUDA_Runtime_jll", "CompilerSupportLibraries_jll", "ExprTools", "GPUArrays", "GPUCompiler", "KernelAbstractions", "LLVM", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "Preferences", "Printf", "Random", "Random123", "RandomNumbers", "Reexport", "Requires", "SparseArrays", "SpecialFunctions", "UnsafeAtomicsLLVM"] git-tree-sha1 = "442d989978ed3ff4e174c928ee879dc09d1ef693" uuid = "052768ef-5323-5732-b1bb-66c8b64840ba" version = "4.3.2" [[deps.CUDA_Driver_jll]] deps = ["Artifacts", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg"] git-tree-sha1 = "498f45593f6ddc0adff64a9310bb6710e851781b" uuid = "4ee394cb-3365-5eb0-8335-949819d2adfc" version = "0.5.0+1" [[deps.CUDA_Runtime_Discovery]] deps = ["Libdl"] git-tree-sha1 = "bcc4a23cbbd99c8535a5318455dcf0f2546ec536" uuid = "1af6417a-86b4-443c-805f-a4643ffb695f" version = "0.2.2" [[deps.CUDA_Runtime_jll]] deps = ["Artifacts", "CUDA_Driver_jll", "JLLWrappers", "LazyArtifacts", "Libdl", "TOML"] git-tree-sha1 = "5248d9c45712e51e27ba9b30eebec65658c6ce29" uuid = "76a88914-d11a-5bdc-97e0-2f5a05c973a2" version = "0.6.0+0" [[deps.GPUArrays]] deps = ["Adapt", "GPUArraysCore", "LLVM", "LinearAlgebra", "Printf", "Random", "Reexport", "Serialization", "Statistics"] git-tree-sha1 = "a3351bc577a6b49297248aadc23a4add1097c2ac" uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7" version = "8.7.1" [[deps.GPUArraysCore]] deps = ["Adapt"] git-tree-sha1 = "2d6ca471a6c7b536127afccfa7564b5b39227fe0" uuid = "46192b85-c4d5-4398-a991-12ede77f4527" version = "0.1.5" [[deps.GPUCompiler]] deps = ["ExprTools", "InteractiveUtils", "LLVM", "Libdl", "Logging", "Scratch", "TimerOutputs", "UUIDs"] git-tree-sha1 = "cb090aea21c6ca78d59672a7e7d13bd56d09de64" uuid = "61eb1bfa-7361-4325-ad38-22787b887f55" version = "0.20.3" [[deps.LLVM]] deps = ["CEnum", "LLVMExtra_jll", "Libdl", "Printf", "Unicode"] git-tree-sha1 = "5007c1421563108110bbd57f63d8ad4565808818" uuid = "929cbde3-209d-540e-8aea-75f648917ca0" version = "5.2.0" [[deps.LLVMExtra_jll]] deps = ["Artifacts", "JLLWrappers", "LazyArtifacts", "Libdl", "TOML"] git-tree-sha1 = "1222116d7313cdefecf3d45a2bc1a89c4e7c9217" uuid = "dad2f222-ce93-54a1-a47d-0025e8a3acab" version = "0.0.22+0" [[deps.LLVMOpenMP_jll]] deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"] git-tree-sha1 = "f689897ccbe049adb19a065c495e75f372ecd42b" uuid = "1d63c593-3942-5779-bab2-d838dc0a180e" version = "15.0.4+0" ```

Version info

Details on Julia:

Julia Version 1.9.2
Commit e4ee485e90 (2023-07-05 09:39 UTC)
Platform Info:
  OS: Linux (x86_64-suse-linux)
  CPU: 72 × Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 1 on 72 virtual cores
Environment:
  LD_LIBRARY_PATH = /glade/u/apps/common/23.08/spack/opt/spack/cuda/12.2.1/lib64:/glade/u/apps/common/23.08/spack/opt/spack/cuda/12.2.1/nvvm/lib64:/glade/u/apps/common/23.08/spack/opt/spack/cuda/12.2.1/extras/CUPTI/lib64:/glade/u/apps/common/23.08/spack/opt/spack/cuda/12.2.1/extras/Debugger/lib64:/glade/u/apps/casper/23.10/spack/opt/spack/openmpi/4.1.6/oneapi/2023.2.1/dgcv/lib:/glade/u/apps/common/23.08/spack/opt/spack/intel-oneapi-compilers/2023.2.1/compiler/2023.2.1/linux/lib:/glade/u/apps/common/23.08/spack/opt/spack/intel-oneapi-compilers/2023.2.1/compiler/2023.2.1/linux/lib/x64:/glade/u/apps/common/23.08/spack/opt/spack/intel-oneapi-compilers/2023.2.1/compiler/2023.2.1/linux/lib/oclfpga/host/linux64/lib:/glade/u/apps/common/23.08/spack/opt/spack/intel-oneapi-compilers/2023.2.1/compiler/2023.2.1/linux/compiler/lib/intel64_lin:/glade/u/apps/casper/23.10/spack/opt/spack/hdf5/1.12.2/oneapi/2023.2.1/6vf2/lib
  JULIA_DEPOT_PATH = /glade/work/tomasc/.julia
  JULIA_EDITOR = vim

Details on CUDA:

Unfortunately I can't get it:

julia> CUDA.versioninfo()
ERROR: CUDA initialization failed
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] functional
   @ /glade/work/tomasc/.julia/packages/CUDA/pCcGc/src/initialization.jl:24 [inlined]
 [3] versioninfo(io::Base.TTY)
   @ CUDA /glade/work/tomasc/.julia/packages/CUDA/pCcGc/src/utilities.jl:32
 [4] versioninfo()
   @ CUDA /glade/work/tomasc/.julia/packages/CUDA/pCcGc/src/utilities.jl:32
 [5] top-level scope
   @ REPL[3]:1

Additional context

Technically I guess I can continue re-compiling my whole Julia environment every time this error happens, but I'd really want to try and avoid that since the environment is complex and it takes a long time. Another note is that compatibility issues regarding the machine and other software I'm using prevent me from using the latest CUDA version and Julia 1.10.

CC @loganpknudsen

maleadt commented 2 months ago

Illegal instruction errors are typically caused by bugs in Julia itself, and not in CUDA.jl. I would recommend trying out an assertions build, which may reveal additional information.

As this also seems to happen during initialization of LLVM.jl, can you try just using LLVM and see if that reproduces the issue?

Details on CUDA:

Unfortunately I can't get it:

julia> CUDA.versioninfo()
ERROR: CUDA initialization failed

I'm confused here; does this mean CUDA.jl never works?

tomchor commented 2 months ago

Illegal instruction errors are typically caused by bugs in Julia itself, and not in CUDA.jl. I would recommend trying out an assertions build, which may reveal additional information.

I'll look into that, but I haven't changed anything on my Julia install or anything, and it used to work. So I'm really at a loss here.

As this also seems to happen during initialization of LLVM.jl, can you try just using LLVM and see if that reproduces the issue?

I'll try that out soon and post results.

Details on CUDA: Unfortunately I can't get it:

julia> CUDA.versioninfo()
ERROR: CUDA initialization failed

I'm confused here; does this mean CUDA.jl never works?

Just to clarify, I can get CUDA to work if remove everything from $JULIA_DEPOT_PATH and reinstantiate, but for some reason even with that I get that erroe when trying out CUDA.versioninfo(). Not sure why.

maleadt commented 2 months ago

I can get CUDA to work if remove everything from $JULIA_DEPOT_PATH and reinstantiate, but for some reason even with that I get that erroe when trying out CUDA.versioninfo(). Not sure why.

That error indicates cuInit failed, so I have a hard time understanding how anything else in CUDA.jl would work in that case. Please check dmesg, there might be a NVIDIA-driver related error reported in there.

tomchor commented 2 months ago

I'll close this because I tried running this again today (after re-compiling everything, which is something I had done in the past) and for some reason things are now working. I didn't do anything different from some previous attempts so my best guess is that the system admin changed something relevant.

Thanks for the help, @maleadt!

maleadt commented 2 months ago

Thanks for the update!