Closed syamajala closed 11 months ago
The best option for debugging this is a minimized reproducer.
Barring that, we could at least try to get a debug ROCm, but that'll be slow both to obtain and to interpret results from.
@elliottslaughter is there a debug ROCm? I thought ROCm is delivered as binary library.
ROCm is open source. You can build it with Spack, and I did so a while back (when I was waiting for ROCm 5.1 to be deployed to the machine). It's finicky, and hideously slow to build, but doable.
Whether Spack lets you do a debug build, I'm not sure, since I didn't try to do that.
But again, this is definitely plan B, and by a wide margin.
I pulled out two of the tasks from S3D that I've seen this crash occur on CalcVolumeTask and CalcSpeciesTask but they seem to work in a standalone example doing a subrank launch. The only other thing I can think that might make a difference is that the standalone example isnt using separate compilation.
I was able to reproduce this using separate compilation with subranks.
There is a standalone example available here on Crusher: /gpfs/alpine/cmb103/world-shared/seshuy/subrank
Run it on an interactive node like this: srun --tasks-per-node 1 --gpus-per-task 8 --cpus-per-task 6 regent.py optimize_index_launch_nested_nodes_cores_4d.rg -ll:gpu 8 -ll:cpu 1 -ll:csize 16384 -ll:fsize 16384 -findex-launch-dynamic 0 -foverride-demand-index-launch 1 -fflow 0 -fseparate 1 -fincr-comp 1 -logfile run_%.log
I'm able to reproduce this.
Some things I've noticed so far:
fill
calls. This can be addressed somewhat by reducing problem size. (Reproduction is not sensitive to problem size.)-ll:gpu 2
(and var subranks = 2
) to crash. A single GPU is not sufficient, even with the same launch pattern.CalcSpeciesTask
is not required. I only need to launch CalcVolumeTask
.Here's a shorter reproducer that works for me:
repro.rg
:
import "regent"
extern task gpu_task(r : region(int))
task main()
var r = region(ispace(ptr, 10), int)
var t = ispace(int1d, 10)
var p = partition(equal, r, t)
for i in t do
gpu_task(p[i])
end
end
local tmp_dir = './'
local root_dir = arg[0]:match(".*/") or "./"
local loaders = terralib.newlist()
local regent_exe = os.getenv('REGENT') or 'regent'
local tasks_rg = "repro_gpu_task.rg"
local tasks_h = "repro_gpu_task.h"
local tasks_so = tmp_dir .. "librepro_gpu_task.so"
if os.execute(regent_exe .. " " .. tmp_dir .. tasks_rg .. " -fseparate 1 -fgpu hip -fgpu-arch gfx90a -fincr-comp 1") ~= 0 then
print("Error: failed to compile " .. tmp_dir .. tasks_rg)
assert(false)
end
local tasks_c = terralib.includec(tasks_h, {"-I", tmp_dir})
loaders:insert(tasks_c["repro_gpu_task_h_register"])
terralib.linklibrary(tasks_so)
terra loader()
[loaders:map(function(thunk) return `thunk() end)]
end
regentlib.start(main, loader)
repro_gpu_task.rg
:
import "regent"
__demand(__cuda)
task gpu_task(r : region(int))
for i in r do
end
end
local repro_gpu_task_h = "./repro_gpu_task.h"
local repro_gpu_task_so = "./librepro_gpu_task.so"
regentlib.save_tasks(repro_gpu_task_h, repro_gpu_task_so, nil, nil, nil, nil, false)
Command:
../regent.py repro.rg -ll:gpu 2 -ll:cpu 1 -ll:csize 1024 -ll:fsize 1024 -fflow 0 -fseparate 1 -fincr-comp 1
Output:
$ ../regent.py repro.rg -ll:gpu 2 -ll:cpu 1 -ll:csize 1024 -ll:fsize 1024 -fflow 0 -fseparate 1 -fincr-comp 1
/opt/rocm-4.5.0/llvm/bin/ld.lld -shared -plugin-opt=mcpu=gfx90a -plugin-opt=-amdgpu-internalize-symbols -plugin-opt=O3 -plugin-opt=-amdgpu-early-inline-all=true -plugin-opt=-amdgpu-function-calls=false -o /tmp/lua_NnKvrd /tmp/lua_sHHPld /opt/rocm-4.5.0/amdgcn/bitcode/oclc_finite_only_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/ocml.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_daz_opt_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_wavefrontsize64_on.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_isa_version_90a.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_unsafe_math_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc
/opt/rocm-4.5.0/llvm/bin/clang-offload-bundler --inputs=/dev/null,/tmp/lua_NnKvrd --type=o --outputs=/tmp/lua_JJsskf --targets=host-x86_64-unknown-linux-gnu,hipv4-amdgcn-amd-amdhsa--gfx90a
[0 - 7fffc17f6780] 0.116352 {4}{hip}: HIP hijack code not active - device synchronizations required after every GPU task!
/opt/rocm-4.5.0/lib/libamdhip64.so.4(+0x11032d) [0x7fffe8db932d]
/opt/rocm-4.5.0/lib/libamdhip64.so.4(+0x118b8d) [0x7fffe8dc1b8d]
/opt/rocm-4.5.0/lib/libamdhip64.so.4(hipModuleLaunchKernel+0x4fa) [0x7fffe8dc27ca]
./librepro_gpu_task.so(+0x73a3) [0x7fffdebb23a3]
I think this is just an issue with registering kernels on multiple GPUs.
Minimal reproducer:
import "regent"
__demand(__cuda)
task gpu_task(r : region(int))
for i in r do
end
end
task main()
var r = region(ispace(ptr, 10), int)
var t = ispace(int1d, 10)
var p = partition(equal, r, t)
for i in t do
gpu_task(p[i])
end
end
regentlib.start(main)
Command:
../regent.py repro.rg -ll:gpu 2 -ll:cpu 1 -ll:csize 1024 -ll:fsize 1024 -fflow 0 -fgpu hip -fgpu-arch gfx90a
Result:
$ ../regent.py repro.rg -ll:gpu 2 -ll:cpu 1 -ll:csize 1024 -ll:fsize 1024 -fflow 0 -fgpu hip -fgpu-arch gfx90a
/opt/rocm-4.5.0/llvm/bin/ld.lld -shared -plugin-opt=mcpu=gfx90a -plugin-opt=-amdgpu-internalize-symbols -plugin-opt=O3 -plugin-opt=-amdgpu-early-inline-all=true -plugin-opt=-amdgpu-function-calls=false -o /tmp/lua_rctCwo /tmp/lua_7MBS7l /opt/rocm-4.5.0/amdgcn/bitcode/oclc_finite_only_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/ocml.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_daz_opt_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_wavefrontsize64_on.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_isa_version_90a.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_unsafe_math_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc
/opt/rocm-4.5.0/llvm/bin/clang-offload-bundler --inputs=/dev/null,/tmp/lua_rctCwo --type=o --outputs=/tmp/lua_JlhK6l --targets=host-x86_64-unknown-linux-gnu,hipv4-amdgcn-amd-amdhsa--gfx90a
[0 - 7fffc19f6780] 0.115502 {4}{hip}: HIP hijack code not active - device synchronizations required after every GPU task!
[0 - 7fffc19f6780] 0.117924 {4}{hip}: HIP hijack is active - device synchronizations not required after every GPU task!
/opt/rocm-4.5.0/lib/libamdhip64.so.4(+0x11032d) [0x7fffe8db932d]
/opt/rocm-4.5.0/lib/libamdhip64.so.4(+0x118b8d) [0x7fffe8dc1b8d]
/opt/rocm-4.5.0/lib/libamdhip64.so.4(hipModuleLaunchKernel+0x4fa) [0x7fffe8dc27ca]
3 terra (JIT) 0x00007fffed9f73d9 $<gpu_task>.64 + 1529
[0x7fffc17f1788]
[0x1]
No separate compilation required to reproduce.
Note to self: reproduce with C++ circuit next (with HIP).
Confirmed that this bug is still present in Regent Circuit with ROCm 5.4.3, 1 node, 2 GPUs. Backtrace:
(gdb) bt
#0 0x00007fffe87f4cc1 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
#1 0x00007fffe87fa9c3 in nanosleep () from /lib64/libc.so.6
#2 0x00007fffe87fa8da in sleep () from /lib64/libc.so.6
#3 0x00007fffebe031b6 in Realm::realm_freeze (signal=11)
at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/runtime_impl.cc:200
#4 <signal handler called>
#5 0x00007fffe8cd75c2 in ?? () from /opt/rocm-5.4.3/lib/libamdhip64.so.5
#6 0x00007fffe8cd7c62 in ?? () from /opt/rocm-5.4.3/lib/libamdhip64.so.5
#7 0x00007fffe8ce763c in hipModuleLaunchKernel () from /opt/rocm-5.4.3/lib/libamdhip64.so.5
#8 0x0000000000415395 in $<calculate_new_currents> ()
#9 0x00000000004140d9 in $__regent_task_calculate_new_currents_cuda ()
#10 0x00007fffec227ef0 in Realm::Hip::GPUProcessor::execute_task (this=0x9ad040, func_id=19, task_args=...)
at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/hip/hip_module.cc:2087
#11 0x00007fffebf7317a in Realm::Task::execute_on_processor (this=0x7ff16805e630, p=...)
at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/tasks.cc:326
#12 0x00007fffebf77146 in Realm::KernelThreadTaskScheduler::execute_task (this=0xc8bf90, task=0x7ff16805e630)
at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/tasks.cc:1421
#13 0x00007fffec254a68 in Realm::Hip::GPUTaskScheduler<Realm::KernelThreadTaskScheduler>::execute_task (this=0xc8bf90, task=0x7ff16805e630)
at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/hip/hip_module.cc:1500
#14 0x00007fffebf75f85 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xc8bf90)
at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/tasks.cc:1160
#15 0x00007fffebf765c2 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0xc8bf90)
at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/tasks.cc:1272
#16 0x00007fffebf7dca4 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0xc8bf90) at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/threads.inl:97
#17 0x00007fffebf4d3d7 in Realm::KernelThread::pthread_entry (data=0xb01e70)
at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/threads.cc:781
#18 0x00007fffe7cb66ea in start_thread () from /lib64/libpthread.so.0
#19 0x00007fffe8830a6f in clone () from /lib64/libc.so.6
Confirmed that multi-GPU works with C++ Circuit with ROCm 5.4.3. So this is probably a bug with either the hipModule*
APIs (since hipcc
does not use those) or with how Regent is using hipModule*
.
Based on feedback from OLCF staff this week, I am testing the following patch for this:
https://gitlab.com/StanfordLegion/legion/-/merge_requests/903
Currently waiting on jobs to finish on Crusher to see if it's working or not.
I fixed some issues and merged the MR. Seshu confirmed that new branch works with HIP multi-GPU.
I'm seeing a crash on Crusher when trying to run rank/node instead of rank/gpu.
Per @elliottslaughter suggestion I tried both rocm 4.5 and rocm 5.2 but both result in a seg fault somewhere in hipModuleLauchKernel when trying to launch a Regent task.
Here is a stack trace: