StanfordLegion / legion

The Legion Parallel Programming System
https://legion.stanford.edu
Apache License 2.0
679 stars 145 forks source link

Regent: Crash on Crusher with multiple GPUs #1310

Closed syamajala closed 11 months ago

syamajala commented 2 years ago

I'm seeing a crash on Crusher when trying to run rank/node instead of rank/gpu.

Per @elliottslaughter suggestion I tried both rocm 4.5 and rocm 5.2 but both result in a seg fault somewhere in hipModuleLauchKernel when trying to launch a Regent task.

Here is a stack trace:

#0  0x00007fffe3765b41 in ?? () from /opt/rocm-5.2.0/lib/libamdhip64.so.5
#1  0x00007fffe376fbd9 in ?? () from /opt/rocm-5.2.0/lib/libamdhip64.so.5
#2  0x00007fffe3770a4f in hipModuleLaunchKernel () from /opt/rocm-5.2.0/lib/libamdhip64.so.5
#3  0x00007fffea56071c in $<CalcSpeciesTask>.62 ()
   from /gpfs/alpine/scratch/seshuy/cmb103/legion_s3d_nscbc_subrank//build/hept/libphysical_tasks.so
#4  0x00007fffea55de99 in $__regent_task_CalcSpeciesTask_cuda ()
   from /gpfs/alpine/scratch/seshuy/cmb103/legion_s3d_nscbc_subrank//build/hept/libphysical_tasks.so
#5  0x00007fffe5a03b5b in Realm::LocalTaskProcessor::execute_task (this=0x4907570, func_id=62, task_args=...)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_subrank/legion/runtime/realm/proc_impl.cc:1135
#6  0x00007fffe5a7bbf9 in Realm::Task::execute_on_processor (this=0x7ff9400a5220, p=...)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_subrank/legion/runtime/realm/tasks.cc:302
#7  0x00007fffe5a7feac in Realm::KernelThreadTaskScheduler::execute_task (this=0x54fa4e0, task=0x7ff9400a5220)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_subrank/legion/runtime/realm/tasks.cc:1366
#8  0x00007fffe5aceabe in Realm::Hip::GPUTaskScheduler<Realm::KernelThreadTaskScheduler>::execute_task (
    this=0x54fa4e0, task=0x7ff9400a5220)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_subrank/legion/runtime/realm/hip/hip_module.cc:1527
#9  0x00007fffe5a7ec40 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0x54fa4e0)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_subrank/legion/runtime/realm/tasks.cc:1105
#10 0x00007fffe5a7f21a in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0x54fa4e0)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_subrank/legion/runtime/realm/tasks.cc:1217
#11 0x00007fffe5a90cfc in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0x54fa4e0)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_subrank/legion/runtime/realm/threads.inl:97
#12 0x00007fffe5a93f3a in Realm::KernelThread::pthread_entry (data=0x5f2faf0)
    at /gpfs/alpine/cmb103/scratch/seshuy/legion_s3d_nscbc_subrank/legion/runtime/realm/threads.cc:774
#13 0x00007fffe47176ea in start_thread () from /lib64/libpthread.so.0
#14 0x00007fffebde1a8f in clone () from /lib64/libc.so.6
elliottslaughter commented 2 years ago

The best option for debugging this is a minimized reproducer.

Barring that, we could at least try to get a debug ROCm, but that'll be slow both to obtain and to interpret results from.

eddy16112 commented 2 years ago

@elliottslaughter is there a debug ROCm? I thought ROCm is delivered as binary library.

elliottslaughter commented 2 years ago

ROCm is open source. You can build it with Spack, and I did so a while back (when I was waiting for ROCm 5.1 to be deployed to the machine). It's finicky, and hideously slow to build, but doable.

Whether Spack lets you do a debug build, I'm not sure, since I didn't try to do that.

But again, this is definitely plan B, and by a wide margin.

syamajala commented 2 years ago

I pulled out two of the tasks from S3D that I've seen this crash occur on CalcVolumeTask and CalcSpeciesTask but they seem to work in a standalone example doing a subrank launch. The only other thing I can think that might make a difference is that the standalone example isnt using separate compilation.

syamajala commented 2 years ago

I was able to reproduce this using separate compilation with subranks.

There is a standalone example available here on Crusher: /gpfs/alpine/cmb103/world-shared/seshuy/subrank

Run it on an interactive node like this: srun --tasks-per-node 1 --gpus-per-task 8 --cpus-per-task 6 regent.py optimize_index_launch_nested_nodes_cores_4d.rg -ll:gpu 8 -ll:cpu 1 -ll:csize 16384 -ll:fsize 16384 -findex-launch-dynamic 0 -foverride-demand-index-launch 1 -fflow 0 -fseparate 1 -fincr-comp 1 -logfile run_%.log

elliottslaughter commented 2 years ago

I'm able to reproduce this.

Some things I've noticed so far:

elliottslaughter commented 2 years ago

Here's a shorter reproducer that works for me:

repro.rg:

import "regent"

extern task gpu_task(r : region(int))

task main()
  var r = region(ispace(ptr, 10), int)
  var t = ispace(int1d, 10)
  var p = partition(equal, r, t)
  for i in t do
    gpu_task(p[i])
  end
end

local tmp_dir = './'
local root_dir = arg[0]:match(".*/") or "./"
local loaders = terralib.newlist()
local regent_exe = os.getenv('REGENT') or 'regent'
local tasks_rg = "repro_gpu_task.rg"
local tasks_h = "repro_gpu_task.h"
local tasks_so = tmp_dir .. "librepro_gpu_task.so"
if os.execute(regent_exe .. " " .. tmp_dir .. tasks_rg .. " -fseparate 1 -fgpu hip -fgpu-arch gfx90a -fincr-comp 1") ~= 0 then
  print("Error: failed to compile " .. tmp_dir .. tasks_rg)
  assert(false)
end
local tasks_c = terralib.includec(tasks_h, {"-I", tmp_dir})
loaders:insert(tasks_c["repro_gpu_task_h_register"])
terralib.linklibrary(tasks_so)

terra loader()
  [loaders:map(function(thunk) return `thunk() end)]
end

regentlib.start(main, loader)

repro_gpu_task.rg:

import "regent"

__demand(__cuda)
task gpu_task(r : region(int))
  for i in r do
  end
end

local repro_gpu_task_h = "./repro_gpu_task.h"
local repro_gpu_task_so = "./librepro_gpu_task.so"
regentlib.save_tasks(repro_gpu_task_h, repro_gpu_task_so, nil, nil, nil, nil, false)

Command:

../regent.py repro.rg -ll:gpu 2 -ll:cpu 1 -ll:csize 1024 -ll:fsize 1024 -fflow 0 -fseparate 1 -fincr-comp 1

Output:

$ ../regent.py repro.rg -ll:gpu 2 -ll:cpu 1 -ll:csize 1024 -ll:fsize 1024 -fflow 0 -fseparate 1 -fincr-comp 1
/opt/rocm-4.5.0/llvm/bin/ld.lld -shared -plugin-opt=mcpu=gfx90a -plugin-opt=-amdgpu-internalize-symbols -plugin-opt=O3 -plugin-opt=-amdgpu-early-inline-all=true -plugin-opt=-amdgpu-function-calls=false -o /tmp/lua_NnKvrd /tmp/lua_sHHPld  /opt/rocm-4.5.0/amdgcn/bitcode/oclc_finite_only_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/ocml.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_daz_opt_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_wavefrontsize64_on.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_isa_version_90a.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_unsafe_math_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc
/opt/rocm-4.5.0/llvm/bin/clang-offload-bundler --inputs=/dev/null,/tmp/lua_NnKvrd --type=o --outputs=/tmp/lua_JJsskf --targets=host-x86_64-unknown-linux-gnu,hipv4-amdgcn-amd-amdhsa--gfx90a
[0 - 7fffc17f6780]    0.116352 {4}{hip}: HIP hijack code not active - device synchronizations required after every GPU task!
/opt/rocm-4.5.0/lib/libamdhip64.so.4(+0x11032d) [0x7fffe8db932d]
/opt/rocm-4.5.0/lib/libamdhip64.so.4(+0x118b8d) [0x7fffe8dc1b8d]
/opt/rocm-4.5.0/lib/libamdhip64.so.4(hipModuleLaunchKernel+0x4fa) [0x7fffe8dc27ca]
./librepro_gpu_task.so(+0x73a3) [0x7fffdebb23a3]
elliottslaughter commented 2 years ago

I think this is just an issue with registering kernels on multiple GPUs.

Minimal reproducer:

import "regent"

__demand(__cuda)
task gpu_task(r : region(int))
  for i in r do
  end
end

task main()
  var r = region(ispace(ptr, 10), int)
  var t = ispace(int1d, 10)
  var p = partition(equal, r, t)
  for i in t do
    gpu_task(p[i])
  end
end

regentlib.start(main)

Command:

../regent.py repro.rg -ll:gpu 2 -ll:cpu 1 -ll:csize 1024 -ll:fsize 1024 -fflow 0 -fgpu hip -fgpu-arch gfx90a

Result:

$ ../regent.py repro.rg -ll:gpu 2 -ll:cpu 1 -ll:csize 1024 -ll:fsize 1024 -fflow 0 -fgpu hip -fgpu-arch gfx90a
/opt/rocm-4.5.0/llvm/bin/ld.lld -shared -plugin-opt=mcpu=gfx90a -plugin-opt=-amdgpu-internalize-symbols -plugin-opt=O3 -plugin-opt=-amdgpu-early-inline-all=true -plugin-opt=-amdgpu-function-calls=false -o /tmp/lua_rctCwo /tmp/lua_7MBS7l  /opt/rocm-4.5.0/amdgcn/bitcode/oclc_finite_only_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/ocml.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_daz_opt_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_wavefrontsize64_on.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_isa_version_90a.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_unsafe_math_off.bc /opt/rocm-4.5.0/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc
/opt/rocm-4.5.0/llvm/bin/clang-offload-bundler --inputs=/dev/null,/tmp/lua_rctCwo --type=o --outputs=/tmp/lua_JlhK6l --targets=host-x86_64-unknown-linux-gnu,hipv4-amdgcn-amd-amdhsa--gfx90a
[0 - 7fffc19f6780]    0.115502 {4}{hip}: HIP hijack code not active - device synchronizations required after every GPU task!
[0 - 7fffc19f6780]    0.117924 {4}{hip}: HIP hijack is active - device synchronizations not required after every GPU task!
/opt/rocm-4.5.0/lib/libamdhip64.so.4(+0x11032d) [0x7fffe8db932d]
/opt/rocm-4.5.0/lib/libamdhip64.so.4(+0x118b8d) [0x7fffe8dc1b8d]
/opt/rocm-4.5.0/lib/libamdhip64.so.4(hipModuleLaunchKernel+0x4fa) [0x7fffe8dc27ca]
3   terra (JIT)                         0x00007fffed9f73d9 $<gpu_task>.64 + 1529 
[0x7fffc17f1788]
[0x1]

No separate compilation required to reproduce.

elliottslaughter commented 1 year ago

Note to self: reproduce with C++ circuit next (with HIP).

elliottslaughter commented 1 year ago

Confirmed that this bug is still present in Regent Circuit with ROCm 5.4.3, 1 node, 2 GPUs. Backtrace:

(gdb) bt
#0  0x00007fffe87f4cc1 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
#1  0x00007fffe87fa9c3 in nanosleep () from /lib64/libc.so.6
#2  0x00007fffe87fa8da in sleep () from /lib64/libc.so.6
#3  0x00007fffebe031b6 in Realm::realm_freeze (signal=11)
    at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/runtime_impl.cc:200
#4  <signal handler called>
#5  0x00007fffe8cd75c2 in ?? () from /opt/rocm-5.4.3/lib/libamdhip64.so.5
#6  0x00007fffe8cd7c62 in ?? () from /opt/rocm-5.4.3/lib/libamdhip64.so.5
#7  0x00007fffe8ce763c in hipModuleLaunchKernel () from /opt/rocm-5.4.3/lib/libamdhip64.so.5
#8  0x0000000000415395 in $<calculate_new_currents> ()
#9  0x00000000004140d9 in $__regent_task_calculate_new_currents_cuda ()
#10 0x00007fffec227ef0 in Realm::Hip::GPUProcessor::execute_task (this=0x9ad040, func_id=19, task_args=...)
    at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/hip/hip_module.cc:2087
#11 0x00007fffebf7317a in Realm::Task::execute_on_processor (this=0x7ff16805e630, p=...)
    at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/tasks.cc:326
#12 0x00007fffebf77146 in Realm::KernelThreadTaskScheduler::execute_task (this=0xc8bf90, task=0x7ff16805e630)
    at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/tasks.cc:1421
#13 0x00007fffec254a68 in Realm::Hip::GPUTaskScheduler<Realm::KernelThreadTaskScheduler>::execute_task (this=0xc8bf90, task=0x7ff16805e630)
    at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/hip/hip_module.cc:1500
#14 0x00007fffebf75f85 in Realm::ThreadedTaskScheduler::scheduler_loop (this=0xc8bf90)
    at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/tasks.cc:1160
#15 0x00007fffebf765c2 in Realm::ThreadedTaskScheduler::scheduler_loop_wlock (this=0xc8bf90)
    at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/tasks.cc:1272
#16 0x00007fffebf7dca4 in Realm::Thread::thread_entry_wrapper<Realm::ThreadedTaskScheduler, &Realm::ThreadedTaskScheduler::scheduler_loop_wlock> (obj=0xc8bf90) at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/threads.inl:97
#17 0x00007fffebf4d3d7 in Realm::KernelThread::pthread_entry (data=0xb01e70)
    at /autofs/nccs-svm1_home1/eslaught/crusher/legion/runtime/realm/threads.cc:781
#18 0x00007fffe7cb66ea in start_thread () from /lib64/libpthread.so.0
#19 0x00007fffe8830a6f in clone () from /lib64/libc.so.6
elliottslaughter commented 1 year ago

Confirmed that multi-GPU works with C++ Circuit with ROCm 5.4.3. So this is probably a bug with either the hipModule* APIs (since hipcc does not use those) or with how Regent is using hipModule*.

elliottslaughter commented 1 year ago

Based on feedback from OLCF staff this week, I am testing the following patch for this:

https://gitlab.com/StanfordLegion/legion/-/merge_requests/903

Currently waiting on jobs to finish on Crusher to see if it's working or not.

elliottslaughter commented 11 months ago

I fixed some issues and merged the MR. Seshu confirmed that new branch works with HIP multi-GPU.