JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.22k stars 221 forks source link

Segmentation fault when importing CUDA #2083

Closed yuvalwas closed 1 year ago

yuvalwas commented 1 year ago

Describe the bug

Hello, not sure if you'll consider this a bug. In the documentation about conditional use users are thought to always be able to import CUDA. However, when I import CUDA on a non-GPU server of my Institute's HPC, I get:


[4389] signal (11.1): Segmentation fault
in expression starting at /home/labs/tsodyks/yuvalw/clusterless/wexac_utils/env_setup.jl:5
__init__ at /home/labs/tsodyks/yuvalw/.julia/packages/CUDA/ZdCxS/src/initialization.jl:42
jfptr___init___3368 at /home/labs/tsodyks/yuvalw/.julia/compiled/v1.9/CUDA/oWw5k_lNa62.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
jl_module_run_initializer at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/toplevel.c:75
ijl_init_restored_modules at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/module.c:982
register_restored_modules at ./loading.jl:1115
_include_from_serialized at ./loading.jl:1061
_require_search_from_serialized at ./loading.jl:1506
_require at ./loading.jl:1783
_require_prelocked at ./loading.jl:1660
macro expansion at ./loading.jl:1648 [inlined]
macro expansion at ./lock.jl:267 [inlined]
require at ./loading.jl:1611
jfptr_require_45889.clone_1 at /apps/easybd/easybuild/software/Julia/1.9.3-linux-x86_64/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
call_require at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/toplevel.c:466 [inlined]
eval_import_path at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/toplevel.c:503
jl_toplevel_eval_flex at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/toplevel.c:731
eval_body at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/interpreter.c:572
eval_body at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/interpreter.c:533
jl_interpret_toplevel_thunk at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/interpreter.c:762
jl_toplevel_eval_flex at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/toplevel.c:912
jl_toplevel_eval_flex at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/toplevel.c:856
ijl_toplevel_eval_in at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/toplevel.c:971
eval at ./boot.jl:370 [inlined]
include_string at ./loading.jl:1903
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
_include at ./loading.jl:1963
include at ./Base.jl:457
jfptr_include_35036.clone_1 at /apps/easybd/easybuild/software/Julia/1.9.3-linux-x86_64/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
exec_options at ./client.jl:307
_start at ./client.jl:522
jfptr__start_40034.clone_1 at /apps/easybd/easybuild/software/Julia/1.9.3-linux-x86_64/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined]
ijl_apply_generic at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/gf.c:2940
jl_apply at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/julia.h:1880 [inlined]
true_main at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/jlapi.c:573
jl_repl_entrypoint at /cache/build/default-amdci5-5/julialang/julia-release-1-dot-9/src/jlapi.c:717
main at julia (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 5143867 (Pool: 5143009; Big: 858); GC: 10
/scratch/1695121957.576635.shell: line 13:  4389 Segmentation fault      (core dumped) julia wexac_utils/env_setup.jl

To reproduce To reproduce the above error I run using CUDA or

  try
    using CUDA
catch
end

which doesn't help.

Expected behavior

Hopefully not crash. At the moment, unaware of a better solution, I use a global flag to determine whether to import CUDA.

Manifest.toml

``` [[deps.CUDA]] deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CUDA_Driver_jll", "CUDA_Runtime_Discovery", "CUDA_Runtime_jll", "CompilerSupportLibraries_jll", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "Preferences", "Printf", "Random", "Random123", "RandomNumbers", "Reexport", "Requires", "SparseArrays", "SpecialFunctions"] git-tree-sha1 = "edff14c60784c8f7191a62a23b15a421185bc8a8" uuid = "052768ef-5323-5732-b1bb-66c8b64840ba" version = "4.0.1" [[deps.GPUArrays]] deps = ["Adapt", "GPUArraysCore", "LLVM", "LinearAlgebra", "Printf", "Random", "Reexport", "Serialization", "Statistics"] git-tree-sha1 = "2e57b4a4f9cc15e85a24d603256fe08e527f48d1" uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7" version = "8.8.1" [[deps.GPUArraysCore]] deps = ["Adapt"] git-tree-sha1 = "2d6ca471a6c7b536127afccfa7564b5b39227fe0" uuid = "46192b85-c4d5-4398-a991-12ede77f4527" version = "0.1.5" [[deps.GPUCompiler]] deps = ["ExprTools", "InteractiveUtils", "LLVM", "Libdl", "Logging", "TimerOutputs", "UUIDs"] git-tree-sha1 = "19d693666a304e8c371798f4900f7435558c7cde" uuid = "61eb1bfa-7361-4325-ad38-22787b887f55" version = "0.17.3" [[deps.LLVM]] deps = ["CEnum", "LLVMExtra_jll", "Libdl", "Printf", "Unicode"] git-tree-sha1 = "f044a2796a9e18e0531b9b3072b0019a61f264bc" uuid = "929cbde3-209d-540e-8aea-75f648917ca0" version = "4.17.1" [[deps.LLVMExtra_jll]] deps = ["Artifacts", "JLLWrappers", "LazyArtifacts", "Libdl", "TOML"] git-tree-sha1 = "070e4b5b65827f82c16ae0916376cb47377aa1b5" uuid = "dad2f222-ce93-54a1-a47d-0025e8a3acab" version = "0.0.18+0" [[deps.LLVMOpenMP_jll]] deps = ["Artifacts", "JLLWrappers", "Libdl", "Pkg"] git-tree-sha1 = "f689897ccbe049adb19a065c495e75f372ecd42b" uuid = "1d63c593-3942-5779-bab2-d838dc0a180e" version = "15.0.4+0" ```

Version info

Details on Julia:

Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD EPYC 7702 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 1 on 128 virtual cores
Environment:
  JULIA_DEPOT_PATH = :
  LD_LIBRARY_PATH = /apps/easybd/easybuild/software/Julia/1.9.3-linux-x86_64/lib:/usr/share/lsf/10.1/linux3.10-glibc2.17-x86_64/lib:/home/labs/testing/almoga/tmp/ncbi-magicblast-1.4.0-src/c++/local/ncbi-vdb-2.9.0-1/lib64

Details on CUDA:

CUDA runtime 11.8, artifact installation
CUDA driver 12.0
NVIDIA driver 525.125.6

Libraries: 
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 12.0.0+525.125.6

Toolchain:
- Julia: 1.9.3
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

1 device:
  0: Quadro RTX 8000 (sm_75, 47.449 GiB / 48.000 GiB available)

Additional context

This might be related to #1465.

maleadt commented 1 year ago

This might be related to #1465.

That seems unlikely; why do you think so?

I'd rather suspect https://github.com/JuliaGPU/CUDA.jl/issues/1798 or so. In any case, you are using an old version of CUDA.jl, so please test again with v4.4 or v5

yuvalwas commented 1 year ago

That seems unlikely; why do you think so?

My comment was mainly based on ignorance, I just went over it when looking for related issues.

In any case, you are using an old version of CUDA.jl, so please test again with v4.4 or v5

I'm having problems with updating, perhaps you would know how to help? At first, when I tried to update CUDA I got

pkg> update CUDA
    Updating registry at `C:\Users\yuvalw.WISMAIN\.julia\registries\General.toml`
ERROR: Unsatisfiable requirements detected for package GR_jll [d2c73de3]:
 GR_jll [d2c73de3] log:
 ├─possible versions are: 0.51.2-0.72.9 or uninstalled
 ├─restricted to versions 0.72.9 by an explicit requirement, leaving only versions: 0.72.9
 └─restricted by compatibility requirements with Qt6Base_jll [c0090381] to versions: 0.51.2-0.72.8 or uninstalled — no versions left
   └─Qt6Base_jll [c0090381] log:
     ├─possible versions are: 6.0.3-6.5.2 or uninstalled
     └─restricted to versions 6.5.2 by an explicit requirement, leaving only versions: 6.5.2

For some reason this doesn't show up anymore, but CUDA still won't update.

When I try to be more specific,

(Clusterless) pkg> add CUDA@4.4
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package KernelAbstractions [63c18a36]:
 KernelAbstractions [63c18a36] log:
 ├─possible versions are: 0.1.0-0.9.8 or uninstalled
 ├─restricted to versions * by Clusterless [26ac66cf], leaving only versions: 0.1.0-0.9.8
 │ └─Clusterless [26ac66cf] log:
 │   ├─possible versions are: 0.1.0 or uninstalled
 │   └─Clusterless [26ac66cf] is fixed to version 0.1.0
 ├─restricted by compatibility requirements with CUDA [052768ef] to versions: 0.9.2-0.9.8
 │ └─CUDA [052768ef] log:
 │   ├─possible versions are: 0.1.0-5.0.0 or uninstalled
 │   ├─restricted to versions * by Clusterless [26ac66cf], leaving only versions: 0.1.0-5.0.0
 │   │ └─Clusterless [26ac66cf] log: see above
 │   └─restricted to versions 4.4 by an explicit requirement, leaving only versions: 4.4.0-4.4.1
 └─restricted by compatibility requirements with CUDAKernels [72cfdca4] to versions: 0.8.0-0.8.6 — no versions left
   └─CUDAKernels [72cfdca4] log:
     ├─possible versions are: 0.1.0-0.4.7 or uninstalled
     ├─restricted to versions * by Clusterless [26ac66cf], leaving only versions: 0.1.0-0.4.7
     │ └─Clusterless [26ac66cf] log: see above
     └─restricted by compatibility requirements with CUDA [052768ef] to versions: 0.4.5-0.4.7 or uninstalled, leaving only versions: 0.4.5-0.4.7       
       └─CUDA [052768ef] log: see above

(Clusterless) pkg> add CUDA@5
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package CUDAKernels [72cfdca4]:
 CUDAKernels [72cfdca4] log:
 ├─possible versions are: 0.1.0-0.4.7 or uninstalled
 ├─restricted to versions * by Clusterless [26ac66cf], leaving only versions: 0.1.0-0.4.7
 │ └─Clusterless [26ac66cf] log:
 │   ├─possible versions are: 0.1.0 or uninstalled
 │   └─Clusterless [26ac66cf] is fixed to version 0.1.0
 └─restricted by compatibility requirements with CUDA [052768ef] to versions: uninstalled — no versions left
   └─CUDA [052768ef] log:
     ├─possible versions are: 0.1.0-5.0.0 or uninstalled
     ├─restricted to versions * by Clusterless [26ac66cf], leaving only versions: 0.1.0-5.0.0
     │ └─Clusterless [26ac66cf] log: see above
     └─restricted to versions 5 by an explicit requirement, leaving only versions: 5.0.0

Thank you!

maleadt commented 1 year ago

You could just try in a temporary, empty environment by using ]activate --temp. In there, there should be no problem installing the latest CUDA.jl.

yuvalwas commented 1 year ago

You are right, there is no problem in a new environment with only CUDA v5.

maleadt commented 1 year ago

OK, I'm going to assume that this is the same issue as https://github.com/JuliaGPU/CUDA.jl/issues/1798 then, which is fixed in v4.4 and v5.

To upgrade your environment to CUDA.jl v4.4, I think you need to get rid of the CUDAKernels dependency, as that is now provided by CUDA.jl (using CUDA.CUDAKernels).

yuvalwas commented 1 year ago

Yes, I just reached the same conclusion that the problem is in CUDAKernels. The only reason I have it installed is because KernelAbstractions and CUDAKernels are supposed to be loaded (If I understood correctly) for Tullio to use them. I'll remove it. Thank you for your help.