Regent: Crash on Crusher with multiple GPUs

syamajala commented 2 years ago

I'm seeing a crash on Crusher when trying to run rank/node instead of rank/gpu.

Per @elliottslaughter suggestion I tried both rocm 4.5 and rocm 5.2 but both result in a seg fault somewhere in hipModuleLauchKernel when trying to launch a Regent task.

Here is a stack trace:

#0  0x00007fffe3765b41 in ?? () from /opt/rocm-5.2.0/lib/libamdhip64.so.5
#1  0x00007fffe376fbd9 in ?? () from /opt/rocm-5.2.0/lib/libamdhip64.so.5
#2  0x00007fffe3770a4f in hipModuleLaunchKernel () from /opt/rocm-5.2.0/lib/libamdhip64.so.5
#3  0x00007fffea56071c in $<CalcSpeciesTask>.62 ()
   from /gpfs/alpine/scratch/seshuy/cmb103/legion_s3d_nscbc_subrank//build/hept/libphysical_tasks.so
#4  0x00007fffea55de99 in $__regent_task_CalcSpeciesTask_cuda ()
   from /gpfs/alpine/scratch/seshuy/cmb103/legion_s3d_nscbc_subrank//build/hept/libphysical_tasks.so
#5  0x00007fffe5a03b5b in Realm::LocalTaskProcessor::execute_task (this=0x4907570, func_id=62, task_args=...)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_subrank/legion/runtime/realm/proc_impl.cc:1135
#6  0x00007fffe5a7bbf9 in Realm::Task::execute_on_processor (this=0x7ff9400a5220, p=...)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_subrank/legion/runtime/realm/tasks.cc:302
#7  0x00007fffe5a7feac in Realm::KernelThreadTaskScheduler::execute_task (this=0x54fa4e0, task=0x7ff9400a5220)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_subrank/legion/runtime/realm/tasks.cc:1366
#8  0x00007fffe5aceabe in Realm::Hip::GPUTaskScheduler<Realm::KernelThreadTaskScheduler>::execute_task (
    this=0x54fa4e0, task=0x7ff9400a5220)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_subrank/legion/runtime/realm/hip/hip_module.cc:1527
#9  0x00007fffe5a7ec40 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x54fa4e0)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_subrank/legion/runtime/realm/tasks.cc:1105
#10 0x00007fffe5a7f21a in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x54fa4e0)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_subrank/legion/runtime/realm/tasks.cc:1217
#11 0x00007fffe5a90cfc in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x54fa4e0)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_subrank/legion/runtime/realm/threads.inl:97
#12 0x00007fffe5a93f3a in Realm::KernelThread::pthread_entry (data=0x5f2faf0)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_subrank/legion/runtime/realm/threads.cc:774
#13 0x00007fffe47176ea in start_thread () from /lib64/libpthread.so.0
#14 0x00007fffebde1a8f in clone () from /lib64/libc.so.6

elliottslaughter commented 2 years ago

The best option for debugging this is a minimized reproducer.

Barring that, we could at least try to get a debug ROCm, but that'll be slow both to obtain and to interpret results from.

eddy16112 commented 2 years ago

@elliottslaughter is there a debug ROCm? I thought ROCm is delivered as binary library.

elliottslaughter commented 2 years ago

ROCm is open source. You can build it with Spack, and I did so a while back (when I was waiting for ROCm 5.1 to be deployed to the machine). It's finicky, and hideously slow to build, but doable.

Whether Spack lets you do a debug build, I'm not sure, since I didn't try to do that.

But again, this is definitely plan B, and by a wide margin.

syamajala commented 2 years ago

I pulled out two of the tasks from S3D that I've seen this crash occur on CalcVolumeTask and CalcSpeciesTask but they seem to work in a standalone example doing a subrank launch. The only other thing I can think that might make a difference is that the standalone example isnt using separate compilation.

syamajala commented 2 years ago

I was able to reproduce this using separate compilation with subranks.

There is a standalone example available here on Crusher: /gpfs/alpine/cmb103/world-shared/seshuy/subrank

Run it on an interactive node like this: srun --tasks-per-node 1 --gpus-per-task 8 --cpus-per-task 6 regent.py optimize_index_launch_nested_nodes_cores_4d.rg -ll:gpu 8 -ll:cpu 1 -ll:csize 16384 -ll:fsize 16384 -findex-launch-dynamic 0 -foverride-demand-index-launch 1 -fflow 0 -fseparate 1 -fincr-comp 1 -logfile run_%.log

elliottslaughter commented 2 years ago

I'm able to reproduce this.

Some things I've noticed so far:

The code is very slow on account of fill calls. This can be addressed somewhat by reducing problem size. (Reproduction is not sensitive to problem size.)
The code requires at least -ll:gpu 2 (and var subranks = 2) to crash. A single GPU is not sufficient, even with the same launch pattern.
The CalcSpeciesTask is not required. I only need to launch CalcVolumeTask.

elliottslaughter commented 2 years ago

Here's a shorter reproducer that works for me:

repro.rg:

import "regent"

extern task gpu_task(r : region(int))

task main()
  var r = region(ispace(ptr, 10), int)
  var t = ispace(int1d, 10)
  var p = partition(equal, r, t)
  for i in t do
    gpu_task(p[i])
  end
end

local tmp_dir = './'
local root_dir = arg[0]:match(".*/") or "./"
local loaders = terralib.newlist()
local regent_exe = os.getenv('REGENT') or 'regent'
local tasks_rg = "repro_gpu_task.rg"
local tasks_h = "repro_gpu_task.h"
local tasks_so = tmp_dir .. "librepro_gpu_task.so"
if os.execute(regent_exe .. " " .. tmp_dir .. tasks_rg .. " -fseparate 1 -fgpu hip -fgpu-arch gfx90a -fincr-comp 1") ~= 0 then
  print("Error: failed to compile " .. tmp_dir .. tasks_rg)
  assert(false)
end
local tasks_c = terralib.includec(tasks_h, {"-I", tmp_dir})
loaders:insert(tasks_c["repro_gpu_task_h_register"])
terralib.linklibrary(tasks_so)

terra loader()
  [loaders:map(function(thunk) return `thunk() end)]
end

regentlib.start(main, loader)

repro_gpu_task.rg:

import "regent"

__demand(__cuda)
task gpu_task(r : region(int))
  for i in r do
  end
end

local repro_gpu_task_h = "./repro_gpu_task.h"
local repro_gpu_task_so = "./librepro_gpu_task.so"
regentlib.save_tasks(repro_gpu_task_h, repro_gpu_task_so, nil, nil, nil, nil, false)

Command:

../regent.py repro.rg -ll:gpu 2 -ll:cpu 1 -ll:csize 1024 -ll:fsize 1024 -fflow 0 -fseparate 1 -fincr-comp 1

Output:

$ ../regent.py repro.rg -ll:gpu 2 -ll:cpu 1 -ll:csize 1024 -ll:fsize 1024 -fflow 0 -fseparate 1 -fincr-comp 1
/opt/rocm-4.5.0/llvm/bin/ld.lld -shared -plugin-opt=mcpu=gfx90a -plugin-opt=-amdgpu-internalize-symbols -plugin-opt=O3 -plugin-opt=-amdgpu-early-inline-all=true -plugin-opt=-amdgpu-function-calls=false -o /tmp/lua_NnKvrd /tmp/lua_sHHPld  /opt/rocm-4.5.0/amdgcn/bitcode/oclc_finite_only_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/ocml.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_daz_opt_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_wavefrontsize64_on.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_isa_version_90a.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_unsafe_math_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc
/opt/rocm-4.5.0/llvm/bin/clang-offload-bundler --inputs=/dev/null,/tmp/lua_NnKvrd --type=o --outputs=/tmp/lua_JJsskf --targets=host-x86_64-unknown-linux-gnu,hipv4-amdgcn-amd-amdhsa--gfx90a
[0 - 7fffc17f6780]    0.116352 {4}{hip}: HIP hijack code not active - device synchronizations required after every GPU task!
/opt/rocm-4.5.0/lib/libamdhip64.so.4(+0x11032d) [0x7fffe8db932d]
/opt/rocm-4.5.0/lib/libamdhip64.so.4(+0x118b8d) [0x7fffe8dc1b8d]
/opt/rocm-4.5.0/lib/libamdhip64.so.4(hipModuleLaunchKernel+0x4fa) [0x7fffe8dc27ca]
./librepro_gpu_task.so(+0x73a3) [0x7fffdebb23a3]

elliottslaughter commented 2 years ago

I think this is just an issue with registering kernels on multiple GPUs.

Minimal reproducer:

import "regent"

__demand(__cuda)
task gpu_task(r : region(int))
  for i in r do
  end
end

task main()
  var r = region(ispace(ptr, 10), int)
  var t = ispace(int1d, 10)
  var p = partition(equal, r, t)
  for i in t do
    gpu_task(p[i])
  end
end

regentlib.start(main)

Command:

../regent.py repro.rg -ll:gpu 2 -ll:cpu 1 -ll:csize 1024 -ll:fsize 1024 -fflow 0 -fgpu hip -fgpu-arch gfx90a

Result:

$ ../regent.py repro.rg -ll:gpu 2 -ll:cpu 1 -ll:csize 1024 -ll:fsize 1024 -fflow 0 -fgpu hip -fgpu-arch gfx90a
/opt/rocm-4.5.0/llvm/bin/ld.lld -shared -plugin-opt=mcpu=gfx90a -plugin-opt=-amdgpu-internalize-symbols -plugin-opt=O3 -plugin-opt=-amdgpu-early-inline-all=true -plugin-opt=-amdgpu-function-calls=false -o /tmp/lua_rctCwo /tmp/lua_7MBS7l  /opt/rocm-4.5.0/amdgcn/bitcode/oclc_finite_only_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/ocml.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_daz_opt_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_wavefrontsize64_on.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_isa_version_90a.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_unsafe_math_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc
/opt/rocm-4.5.0/llvm/bin/clang-offload-bundler --inputs=/dev/null,/tmp/lua_rctCwo --type=o --outputs=/tmp/lua_JlhK6l --targets=host-x86_64-unknown-linux-gnu,hipv4-amdgcn-amd-amdhsa--gfx90a
[0 - 7fffc19f6780]    0.115502 {4}{hip}: HIP hijack code not active - device synchronizations required after every GPU task!
[0 - 7fffc19f6780]    0.117924 {4}{hip}: HIP hijack is active - device synchronizations not required after every GPU task!
/opt/rocm-4.5.0/lib/libamdhip64.so.4(+0x11032d) [0x7fffe8db932d]
/opt/rocm-4.5.0/lib/libamdhip64.so.4(+0x118b8d) [0x7fffe8dc1b8d]
/opt/rocm-4.5.0/lib/libamdhip64.so.4(hipModuleLaunchKernel+0x4fa) [0x7fffe8dc27ca]
3   terra (JIT)                         0x00007fffed9f73d9 $<gpu_task>.64 + 1529 
[0x7fffc17f1788]
[0x1]

No separate compilation required to reproduce.

elliottslaughter commented 1 year ago

Note to self: reproduce with C++ circuit next (with HIP).

elliottslaughter commented 1 year ago

Confirmed that this bug is still present in Regent Circuit with ROCm 5.4.3, 1 node, 2 GPUs. Backtrace:

(gdb) bt
#0  0x00007fffe87f4cc1 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
#1  0x00007fffe87fa9c3 in nanosleep () from /lib64/libc.so.6
#2  0x00007fffe87fa8da in sleep () from /lib64/libc.so.6
#3  0x00007fffebe031b6 in Realm::realm_freeze (signal=11)
    at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/runtime_impl.cc:200
#4  <signal handler called>
#5  0x00007fffe8cd75c2 in ?? () from /opt/rocm-5.4.3/lib/libamdhip64.so.5
#6  0x00007fffe8cd7c62 in ?? () from /opt/rocm-5.4.3/lib/libamdhip64.so.5
#7  0x00007fffe8ce763c in hipModuleLaunchKernel () from /opt/rocm-5.4.3/lib/libamdhip64.so.5
#8  0x0000000000415395 in $<calculate_new_currents> ()
#9  0x00000000004140d9 in $__regent_task_calculate_new_currents_cuda ()
#10 0x00007fffec227ef0 in Realm::Hip::GPUProcessor::execute_task (this=0x9ad040, func_id=19, task_args=...)
    at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/hip/hip_module.cc:2087
#11 0x00007fffebf7317a in Realm::Task::execute_on_processor (this=0x7ff16805e630, p=...)
    at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/tasks.cc:326
#12 0x00007fffebf77146 in Realm::KernelThreadTaskScheduler::execute_task (this=0xc8bf90, task=0x7ff16805e630)
    at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/tasks.cc:1421
#13 0x00007fffec254a68 in Realm::Hip::GPUTaskScheduler<Realm::KernelThreadTaskScheduler>::execute_task (this=0xc8bf90, task=0x7ff16805e630)
    at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/hip/hip_module.cc:1500
#14 0x00007fffebf75f85 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xc8bf90)
    at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/tasks.cc:1160
#15 0x00007fffebf765c2 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0xc8bf90)
    at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/tasks.cc:1272
#16 0x00007fffebf7dca4 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0xc8bf90) at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/threads.inl:97
#17 0x00007fffebf4d3d7 in Realm::KernelThread::pthread_entry (data=0xb01e70)
    at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/threads.cc:781
#18 0x00007fffe7cb66ea in start_thread () from /lib64/libpthread.so.0
#19 0x00007fffe8830a6f in clone () from /lib64/libc.so.6

elliottslaughter commented 1 year ago

Confirmed that multi-GPU works with C++ Circuit with ROCm 5.4.3. So this is probably a bug with either the hipModule* APIs (since hipcc does not use those) or with how Regent is using hipModule*.

elliottslaughter commented 1 year ago

Based on feedback from OLCF staff this week, I am testing the following patch for this:

https://gitlab.com/StanfordLegion/legion/-/merge_requests/903

Currently waiting on jobs to finish on Crusher to see if it's working or not.

elliottslaughter commented 11 months ago

I fixed some issues and merged the MR. Seshu confirmed that new branch works with HIP multi-GPU.

StanfordLegion / legion

Regent: Crash on Crusher with multiple GPUs #1310