JuliaGPU / AMDGPU.jl

AMD GPU (ROCm) programming in Julia
Other
280 stars 46 forks source link

Segfault in libamdhip #216

Open Keno opened 2 years ago

Keno commented 2 years ago
julia> a_d = ROCArray(a)
32-element ROCVector{Float64}:
free(): invalid pointer

signal (6): Aborted
in expression starting at none:0
gsignal at /usr/lib/libc.so.6 (unknown line)
abort at /usr/lib/libc.so.6 (unknown line)
__libc_message at /usr/lib/libc.so.6 (unknown line)
malloc_printerr at /usr/lib/libc.so.6 (unknown line)
_int_free at /usr/lib/libc.so.6 (unknown line)
cfree at /usr/lib/libc.so.6 (unknown line)
unknown function (ip: 0x7fb9316ce317)
unknown function (ip: 0x7fb9316cf7e7)
unknown function (ip: 0x7fb93167e30e)
unknown function (ip: 0x7fb93169426d)
unknown function (ip: 0x7fb93157c834)
__pthread_once_slow at /usr/lib/libpthread.so.0 (unknown line)
hipStreamSynchronize at /home/deck/.julia/artifacts/b5a35fe56035e3d95e3203689c38aafec324a861/hip/lib/libamdhip64.so (unknown line)
macro expansion at /home/deck/.julia/packages/AMDGPU/PtNLZ/src/hip/error.jl:149 [inlined]
hipStreamSynchronize at /home/deck/.julia/packages/AMDGPU/PtNLZ/src/hip/libhip.jl:2
wait! at /home/deck/.julia/packages/AMDGPU/PtNLZ/src/sync.jl:20
wait! at /home/deck/.julia/packages/AMDGPU/PtNLZ/src/array.jl:86 [inlined]
copyto! at /home/deck/.julia/packages/AMDGPU/PtNLZ/src/array.jl:182
copyto! at /home/deck/.julia/packages/GPUArrays/VNhDf/src/host/abstractarray.jl:95 [inlined]
copyto_axcheck! at ./abstractarray.jl:1104 [inlined]
Array at ./array.jl:563 [inlined]
Array at ./boot.jl:481 [inlined]
convert at ./array.jl:554 [inlined]
adapt_storage at /home/deck/.julia/packages/GPUArrays/VNhDf/src/host/abstractarray.jl:45 [inlined]
adapt_structure at /home/deck/.julia/packages/Adapt/wASZA/src/Adapt.jl:42 [inlined]
adapt at /home/deck/.julia/packages/Adapt/wASZA/src/Adapt.jl:40 [inlined]
print_array at /home/deck/.julia/packages/GPUArrays/VNhDf/src/host/abstractarray.jl:48 [inlined]
show at ./arrayshow.jl:396
unknown function (ip: 0x7fb9326da581)
julia> versioninfo()
Julia Version 1.7.2
Commit bf53498635 (2022-02-06 15:21 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Custom APU 0405
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, znver2)
jpsamaroo commented 2 years ago

AMD Custom APU 0405

Is this a special/experimental APU? In the past, we've had bugs and segfaults with APUs (including on my own).

Are you using AMDGPU-provided ROCm artifacts, or system libraries?

Keno commented 2 years ago

Is this a special/experimental APU?

No, this is an AMD Van Gogh APU

Are you using AMDGPU-provided ROCm artifacts, or system libraries?

AMDGPU-provided

jpsamaroo commented 2 years ago

Can you try disabling artifacts with JULIA_AMDGPU_DISABLE_ARTIFACTS=1 and re-building AMDGPU? Assuming you have a system-provided ROCm available.

Keno commented 2 years ago

Can you try disabling artifacts with JULIA_AMDGPU_DISABLE_ARTIFACTS=1 and re-building AMDGPU? Assuming you have a system-provided ROCm available.

Segfaults also, similar backtrace:

signal (11): Segmentation fault
in expression starting at none:0
unknown function (ip: 0x7f09e0f1e0fd)
unknown function (ip: 0x7f09e0f1e3b7)
hipStreamSynchronize at /opt/rocm/lib/libamdhip64.so (unknown line)
macro expansion at /home/deck/.julia/packages/AMDGPU/PtNLZ/src/hip/error.jl:149 [inlined]
hipStreamSynchronize at /home/deck/.julia/packages/AMDGPU/PtNLZ/src/hip/libhip.jl:2
jpsamaroo commented 2 years ago

So, if you want to just hide libamdhip64.so from AMDGPU (just make it .bak or similar), we can load without it. You may also need to do the same for rocBLAS, rocFFT, et. al.

If you actually want full functionality, then building glibc with debug symbols would be very helpful.

Keno commented 2 years ago

Is it actually in glibc though? Presumably __pthread_once_slow calls back into whatever callback HIP passes it. I tried building HIP with debug symbols, but ran into https://github.com/JuliaPackaging/Yggdrasil/pull/4689#issuecomment-1081262980