Open sakura-nyaa opened 1 week ago
What OS are you on? Is this an official build of ROCm?
I'd take a look at output of libtree on libMIOpen.so
:
pxl-th@Tower:~$ libtree /opt/rocm-6.2.2/lib/libMIOpen.so
libMIOpen.so.1
├── libhiprtc.so.6 [runpath]
│ └── libnuma.so.1 [ld.so.conf]
├── libamdhip64.so.6 [runpath]
│ ├── librocprofiler-register.so.0 [runpath]
│ ├── libamd_comgr.so.2 [runpath]
│ │ ├── libz.so.1 [ld.so.conf]
│ │ ├── libtinfo.so.6 [ld.so.conf]
│ │ └── libzstd.so.1 [ld.so.conf]
│ ├── libhsa-runtime64.so.1 [runpath]
│ │ ├── librocprofiler-register.so.0 [runpath]
│ │ ├── libdrm.so.2 [ld.so.conf]
│ │ ├── libdrm_amdgpu.so.1 [ld.so.conf]
│ │ │ └── libdrm.so.2 [ld.so.conf]
│ │ ├── libelf.so.1 [ld.so.conf]
│ │ │ ├── libz.so.1 [ld.so.conf]
│ │ │ └── libzstd.so.1 [ld.so.conf]
│ │ └── libnuma.so.1 [ld.so.conf]
│ └── libnuma.so.1 [ld.so.conf]
├── libroctx64.so.4 [runpath]
├── librocblas.so.4 [runpath]
│ └── libamdhip64.so.6 [runpath]
├── librocm-core.so.1 [runpath]
├── libamd_comgr.so.2 [runpath]
└── libzstd.so.1 [ld.so.conf]
I run into the same problem on arch linux. Previously my setup worked but I think after a rocm update it stopped working.
When I tried libtree, I noticed libmiopen was not actually installed. Maybe the ROCm packages where split up and a dependency is missing. Installing miopen did not fix the issue but gives this libtree.
libMIOpen.so.1
├── libhiprtc.so.6 [runpath]
├── libamdhip64.so.6 [runpath]
│ ├── librocprofiler-register.so.0 [runpath]
│ │ ├── libfmt.so.11 [default path]
│ │ └── libglog.so.2 [default path]
│ │ └── libgflags.so.2.2 [default path]
│ ├── libamd_comgr.so.2 [runpath]
│ │ ├── libz.so.1 [default path]
│ │ ├── libncursesw.so.6 [default path]
│ │ └── libzstd.so.1 [default path]
│ ├── libhsa-runtime64.so.1 [runpath]
│ │ ├── libhsakmt.so.1 [ld.so.conf]
│ │ │ ├── libdrm.so.2 [default path]
│ │ │ ├── libnuma.so.1 [default path]
│ │ │ └── libdrm_amdgpu.so.1 [default path]
│ │ │ └── libdrm.so.2 [default path]
│ │ ├── libelf.so.1 [default path]
│ │ │ ├── libz.so.1 [default path]
│ │ │ └── libzstd.so.1 [default path]
│ │ └── libdrm.so.2 [default path]
│ └── libnuma.so.1 [default path]
├── libroctx64.so.4 [runpath]
├── libamd_comgr.so.2 [runpath]
├── librocblas.so.4 [runpath]
│ └── libamdhip64.so.6 [runpath]
├── libbz2.so.1.0 [default path]
└── libsqlite3.so.0 [default path]
Hi, @laochailan. Can you try moving:
global libMIOpen_path = get_library(lib_prefix * "MIOpen"; rocm_path)
before line:
global libhsaruntime = if Sys.islinux()
get_library("libhsa-runtime64"; rocm_path, ext="so.1")
else
""
end
in src/discovery/discovery.jl
file and see if it also helps you?
Also on Arch and also having the same issue. Moving the libMIOpen_path line doesn't seem to fix it.
Update: moving the discovery of all libraries (rocblas, rocfft, rocsolver, etc.) before the hsaruntime one does the trick. Not sure what changed. I don't know what effect this might have on other platforms, but if you don't think if affects anything, I can submit a PR. EDIT: while I can allocate arrays on the GPU, even trying to multiply gives this core dump:
julia: /usr/src/debug/hip-runtime/clr-rocm-6.2.2/hipamd/src/hip_code_object.cpp:1152: hip::FatBinaryInfo** hip::StatCO::addFatBinary(const void*, bool): Assertion `err == hipSuccess' failed.
[323653] signal 6 (-6): Aborted
in expression starting at REPL[3]:1
unknown function (ip: 0x7e1f4f62d3f4)
gsignal at /usr/bin/../lib/libc.so.6 (unknown line)
abort at /usr/bin/../lib/libc.so.6 (unknown line)
unknown function (ip: 0x7e1f4f5bb3de)
__assert_fail at /usr/bin/../lib/libc.so.6 (unknown line)
unknown function (ip: 0x7e1ef6a50954)
unknown function (ip: 0x7e1e766ec8a8)
unknown function (ip: 0x7e1f4f79e5b6)
unknown function (ip: 0x7e1f4f79e6ac)
_dl_catch_exception at /lib64/ld-linux-x86-64.so.2 (unknown line)
unknown function (ip: 0x7e1f4f7a54fb)
_dl_catch_exception at /lib64/ld-linux-x86-64.so.2 (unknown line)
unknown function (ip: 0x7e1f4f7a5903)
unknown function (ip: 0x7e1f4f626f13)
_dl_catch_exception at /lib64/ld-linux-x86-64.so.2 (unknown line)
unknown function (ip: 0x7e1f4f79b678)
unknown function (ip: 0x7e1f4f6269f2)
dlopen at /usr/bin/../lib/libc.so.6 (unknown line)
ijl_load_dynamic_library at /cache/build/builder-demeter6-6/julialang/julia-master/src/dlload.c:365
jl_get_library_ at /cache/build/builder-demeter6-6/julialang/julia-master/src/runtime_ccall.cpp:45 [inlined]
jl_get_library_ at /cache/build/builder-demeter6-6/julialang/julia-master/src/runtime_ccall.cpp:29
ijl_lazy_load_and_lookup at /cache/build/builder-demeter6-6/julialang/julia-master/src/runtime_ccall.cpp:73
macro expansion at /home/fra/.julia/packages/AMDGPU/yqCEl/src/utils.jl:134 [inlined]
rocblas_create_handle at /home/fra/.julia/packages/AMDGPU/yqCEl/src/blas/librocblas.jl:230
macro expansion at /home/fra/.julia/packages/AMDGPU/yqCEl/src/utils.jl:134 [inlined]
create_handle at /home/fra/.julia/packages/AMDGPU/yqCEl/src/blas/rocBLAS.jl:36 [inlined]
#14 at /home/fra/.julia/packages/AMDGPU/yqCEl/src/cache.jl:103 [inlined]
#5 at /home/fra/.julia/packages/AMDGPU/yqCEl/src/cache.jl:29
lock at ./lock.jl:232
check_cache at /home/fra/.julia/packages/AMDGPU/yqCEl/src/cache.jl:27 [inlined]
pop! at /home/fra/.julia/packages/AMDGPU/yqCEl/src/cache.jl:48 [inlined]
new_state at /home/fra/.julia/packages/AMDGPU/yqCEl/src/cache.jl:102
#18 at /home/fra/.julia/packages/AMDGPU/yqCEl/src/cache.jl:115 [inlined]
get! at ./dict.jl:458
library_state at /home/fra/.julia/packages/AMDGPU/yqCEl/src/cache.jl:115
lib_state at /home/fra/.julia/packages/AMDGPU/yqCEl/src/blas/rocBLAS.jl:48 [inlined]
gemm! at /home/fra/.julia/packages/AMDGPU/yqCEl/src/blas/wrappers.jl:562 [inlined]
generic_matmatmul! at /home/fra/.julia/packages/AMDGPU/yqCEl/src/blas/highlevel.jl:178
generic_matmatmul! at /home/fra/.julia/packages/AMDGPU/yqCEl/src/blas/highlevel.jl:148 [inlined]
_mul! at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:287 [inlined]
mul! at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:285 [inlined]
mul! at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:253 [inlined]
* at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/LinearAlgebra/src/matmul.jl:124
unknown function (ip: 0x7e1f42f27da6)
jl_apply at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia.h:2157 [inlined]
do_call at /cache/build/builder-demeter6-6/julialang/julia-master/src/interpreter.c:126
eval_value at /cache/build/builder-demeter6-6/julialang/julia-master/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-demeter6-6/julialang/julia-master/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-demeter6-6/julialang/julia-master/src/interpreter.c:663
jl_interpret_toplevel_thunk at /cache/build/builder-demeter6-6/julialang/julia-master/src/interpreter.c:821
jl_toplevel_eval_flex at /cache/build/builder-demeter6-6/julialang/julia-master/src/toplevel.c:943
jl_toplevel_eval_flex at /cache/build/builder-demeter6-6/julialang/julia-master/src/toplevel.c:886
ijl_toplevel_eval_in at /cache/build/builder-demeter6-6/julialang/julia-master/src/toplevel.c:994
eval at ./boot.jl:430 [inlined]
eval_user_input at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:245
repl_backend_loop at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:342
#start_repl_backend#59 at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:327
start_repl_backend at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:324
#run_repl#72 at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:483
run_repl at /cache/build/builder-demeter6-6/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:469
jfptr_run_repl_10088 at /usr/share/julia/compiled/v1.11/REPL/u0gqU_GYsA8.so (unknown line)
#1139 at ./client.jl:446
jfptr_YY.1139_14649 at /usr/share/julia/compiled/v1.11/REPL/u0gqU_GYsA8.so (unknown line)
jl_apply at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia.h:2157 [inlined]
jl_f__call_latest at /cache/build/builder-demeter6-6/julialang/julia-master/src/builtins.c:875
#invokelatest#2 at ./essentials.jl:1055 [inlined]
invokelatest at ./essentials.jl:1052 [inlined]
run_main_repl at ./client.jl:430
repl_main at ./client.jl:567 [inlined]
_start at ./client.jl:541
jfptr__start_72144.1 at /usr/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-demeter6-6/julialang/julia-master/src/julia.h:2157 [inlined]
true_main at /cache/build/builder-demeter6-6/julialang/julia-master/src/jlapi.c:900
jl_repl_entrypoint at /cache/build/builder-demeter6-6/julialang/julia-master/src/jlapi.c:1059
main at julia (unknown line)
unknown function (ip: 0x7e1f4f5bce07)
__libc_start_main at /usr/bin/../lib/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 6981943 (Pool: 6981676; Big: 267); GC: 9
zsh: IOT instruction (core dumped) julia
It seems to be the rocblas call that is giving issues. If I do elementwise multiplication it works. However, upon calling exit(), I get a segfault. Definitely something fishy going on.
Whatever it is got solved by downgrading ROCm to 6.0.2. Don't know if this is something Arch-specific.
Hoping somebody who understands HIP/ROCM better than me can help me understand whats going on here. Using the version you get when you use "add AMDGPU" I get a core dump instantly. By going into src/discovery/discovery.jl and moving
up to the top (it needs to come before libhsa gets loaded. one line below and the coredumps return.):
the core dumps stop and everything seems to work normally. Anybody have any ideas? Thanks for any help.
commit:
AMDGPU.versioninfo()
GDB backtrace:
rocminfo: