Deadlock when running with AMD GPUs

mariodirenzo commented 2 months ago

When running HTR++ unit tests on Tioga, some tests freeze without any error message. The freeze is deterministic and happens only when an AMD gpu is utilized. The backtraces of a hanging execution look like this:

* thread #1, name = 'averageTest.exe', stop reason = signal SIGSTOP
  * frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x00001555553e7200) at mutex.cc:265:26
  thread #2, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb7cb libc.so.6`ioctl + 11
    frame #1: 0x0000155545d2b748 libhsa-runtime64.so.1`kmtIoctl + 40
    frame #2: 0x0000155545d249d5 libhsa-runtime64.so.1`hsaKmtWaitOnMultipleEvents_Ext + 245
    frame #3: 0x0000155545cbc784 libhsa-runtime64.so.1`rocr::core::Signal::WaitAny(unsigned int, hsa_signal_s const*, hsa_signal_condition_t const*, long const*, unsigned long, hsa_wait_state_t, long*) + 4228
    frame #4: 0x0000155545c9822e libhsa-runtime64.so.1`rocr::AMD::hsa_amd_signal_wait_any(unsigned int, hsa_signal_s*, hsa_signal_condition_t*, long*, unsigned long, hsa_wait_state_t, long*) + 94
    frame #5: 0x0000155545cae0ea libhsa-runtime64.so.1`rocr::core::Runtime::AsyncEventsLoop(void*) + 154
    frame #6: 0x0000155545c68677 libhsa-runtime64.so.1`rocr::os::ThreadTrampoline(void*) + 23
    frame #7: 0x00001555523c21ca libpthread.so.0`start_thread + 234
    frame #8: 0x000015554e6fbe73 libc.so.6`__clone + 67
  thread #3, name = 'averageTest.exe'
    frame #0: 0x0000155553d250f5 libamdhip64.so.6`bool roc::VirtualGPU::dispatchGenericAqlPacket<hsa_kernel_dispatch_packet_s>(hsa_kernel_dispatch_packet_s*, unsigned short, unsigned short, bool, unsigned long) + 645
    frame #1: 0x0000155553d22f67 libamdhip64.so.6`roc::VirtualGPU::submitKernelInternal(amd::NDRangeContainer const&, amd::Kernel const&, unsigned char const*, void*, unsigned int, amd::NDRangeKernelCommand*, hsa_kernel_dispatch_packet_s*) + 3751
    frame #2: 0x0000155553d4caa5 libamdhip64.so.6`roc::KernelBlitManager::fillBuffer1D(device::Memory&, void const*, unsigned long, amd::Coord3D const&, amd::Coord3D const&, amd::Coord3D const&, bool, bool) const + 1317
    frame #3: 0x0000155553d451f6 libamdhip64.so.6`roc::KernelBlitManager::fillBuffer(device::Memory&, void const*, unsigned long, amd::Coord3D const&, amd::Coord3D const&, amd::Coord3D const&, bool, bool) const + 550
    frame #4: 0x0000155553d1a432 libamdhip64.so.6`roc::VirtualGPU::fillMemory(unsigned int, amd::Memory*, void const*, unsigned long, amd::Coord3D const&, amd::Coord3D const&, amd::Coord3D const&, bool) + 706
    frame #5: 0x0000155553d1c03f libamdhip64.so.6`roc::VirtualGPU::submitFillMemory(amd::FillMemoryCommand&) + 191
    frame #6: 0x0000155553cf3fe1 libamdhip64.so.6`amd::Command::enqueue() + 1137
    frame #7: 0x0000155553baf9e8 libamdhip64.so.6`ihipMemset3D(hipPitchedPtr, int, hipExtent, ihipStream_t*, bool) + 248
    frame #8: 0x0000155553bafb8b libamdhip64.so.6`hipMemset2DAsync_common(void*, unsigned long, int, unsigned long, unsigned long, ihipStream_t*) + 171
    frame #9: 0x0000155553bec222 libamdhip64.so.6`hipMemset2DAsync + 466
    frame #10: 0x000000000410a55a averageTest.exec`Realm::Hip::GPUfillXferDes::progress_xd(this=0x0000154c98175190, channel=0x00000000054d3050, work_until=<unavailable>) at hip_internal.cc:953:19
    frame #11: 0x0000000004111f34 averageTest.exec`Realm::XDQueue<Realm::Hip::GPUfillChannel, Realm::Hip::GPUfillXferDes>::do_work(this=0x00000000054d3088, work_until=<unavailable>) at channel.inl:157:35
    frame #12: 0x0000000003fe2038 averageTest.exec`Realm::BackgroundWorkManager::Worker::do_work(this=0x0000155537ffc520, max_time_in_ns=<unavailable>, interrupt_flag=0x0000000000000000) at bgwork.cc:599:41
    frame #13: 0x0000000003fe2911 averageTest.exec`Realm::BackgroundWorkThread::main_loop(this=0x0000000004ea9f50) at bgwork.cc:103:22
    frame #14: 0x00000000040d8441 averageTest.exec`Realm::KernelThread::pthread_entry(data=0x0000000005287e00) at threads.cc:831:29
    frame #15: 0x00001555523c21ca libpthread.so.0`start_thread + 234
    frame #16: 0x000015554e6fbe73 libc.so.6`__clone + 67
  thread #4, name = 'averageTest.exe'
    frame #0: 0x000015554e75b40b libc.so.6`sysmalloc + 379
    frame #1: 0x000015554e75c840 libc.so.6`_int_malloc + 3392
    frame #2: 0x000015554e75d972 libc.so.6`malloc + 498
    frame #3: 0x000015554ed35d7c libstdc++.so.6`operator new(unsigned long) + 28
    frame #4: 0x0000155553baf790 libamdhip64.so.6`ihipMemset3DCommand(std::vector<amd::Command*, std::allocator<amd::Command*>>&, hipPitchedPtr, int, hipExtent, hip::Stream*, unsigned long) + 368
    frame #5: 0x0000155553baf9c4 libamdhip64.so.6`ihipMemset3D(hipPitchedPtr, int, hipExtent, ihipStream_t*, bool) + 212
    frame #6: 0x0000155553bafb8b libamdhip64.so.6`hipMemset2DAsync_common(void*, unsigned long, int, unsigned long, unsigned long, ihipStream_t*) + 171
    frame #7: 0x0000155553bec222 libamdhip64.so.6`hipMemset2DAsync + 466
    frame #8: 0x000000000410a55a averageTest.exec`Realm::Hip::GPUfillXferDes::progress_xd(this=0x0000154c983b4fb0, channel=0x00000000054d3050, work_until=<unavailable>) at hip_internal.cc:953:19
    frame #9: 0x0000000004111f34 averageTest.exec`Realm::XDQueue<Realm::Hip::GPUfillChannel, Realm::Hip::GPUfillXferDes>::do_work(this=0x00000000054d3088, work_until=<unavailable>) at channel.inl:157:35
    frame #10: 0x0000000003fe2038 averageTest.exec`Realm::BackgroundWorkManager::Worker::do_work(this=0x0000155437dfc520, max_time_in_ns=<unavailable>, interrupt_flag=0x0000000000000000) at bgwork.cc:599:41
    frame #11: 0x0000000003fe2911 averageTest.exec`Realm::BackgroundWorkThread::main_loop(this=0x0000000004eaa100) at bgwork.cc:103:22
    frame #12: 0x00000000040d8441 averageTest.exec`Realm::KernelThread::pthread_entry(data=0x0000000005287fb0) at threads.cc:831:29
    frame #13: 0x00001555523c21ca libpthread.so.0`start_thread + 234
    frame #14: 0x000015554e6fbe73 libc.so.6`__clone + 67
  thread #5, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000154cb1677200) at mutex.cc:265:26
    frame #2: 0x00000000040cc7f4 averageTest.exec`Realm::KernelThreadTaskScheduler::worker_sleep(this=0x0000000004f677d0, switch_to=0x0000154d300141c0) at tasks.cc:1469:17
    frame #3: 0x00000000040cd4ea averageTest.exec`Realm::ThreadedTaskScheduler::thread_blocking(this=0x0000000004f677d0, thread=<unavailable>) at tasks.cc:963:14
  thread #6, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000154cb1472200) at mutex.cc:265:26
    frame #2: 0x00000000040c8c08 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(unsigned long) [inlined] Realm::ThreadedTaskScheduler::wait_for_work(old_work_counter=<unavailable>, this=0x00000000054fead0) at tasks.cc:1294:33
    frame #3: 0x00000000040c8c00 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(this=0x00000000054fead0, old_work_counter=<unavailable>) at tasks.cc:1275:8
    frame #4: 0x00000000040d0053 averageTest.exec`Realm::ThreadedTaskScheduler::scheduler_loop(this=0x00000000054fead0) at tasks.cc:1260:21
  thread #7, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000155337bff200) at mutex.cc:265:26
  thread #8, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000154cb126d200) at mutex.cc:265:26
    frame #2: 0x00000000040c8c08 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(unsigned long) [inlined] Realm::ThreadedTaskScheduler::wait_for_work(old_work_counter=<unavailable>, this=0x0000000004f677d0) at tasks.cc:1294:33
    frame #3: 0x00000000040c8c00 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(this=0x0000000004f677d0, old_work_counter=<unavailable>) at tasks.cc:1275:8
    frame #4: 0x00000000040d0053 averageTest.exec`Realm::ThreadedTaskScheduler::scheduler_loop(this=0x0000000004f677d0) at tasks.cc:1260:21
  thread #9, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb7cb libc.so.6`ioctl + 11
    frame #1: 0x0000155545d2b748 libhsa-runtime64.so.1`kmtIoctl + 40
    frame #2: 0x0000155545d249d5 libhsa-runtime64.so.1`hsaKmtWaitOnMultipleEvents_Ext + 245
    frame #3: 0x0000155545d25187 libhsa-runtime64.so.1`hsaKmtWaitOnEvent_Ext + 39
    frame #4: 0x0000155545c9cfb5 libhsa-runtime64.so.1`rocr::core::InterruptSignal::WaitRelaxed(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) + 693
    frame #5: 0x0000155545c9ccba libhsa-runtime64.so.1`rocr::core::InterruptSignal::WaitAcquire(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) + 10
    frame #6: 0x0000155545c924d1 libhsa-runtime64.so.1`rocr::HSA::hsa_signal_wait_scacquire(hsa_signal_s, hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) + 209
    frame #7: 0x0000155553d3e575 libamdhip64.so.6`HostcallListener::consumePackets() + 133
    frame #8: 0x0000155553ae6e09 libamdhip64.so.6`amd::Thread::main() + 105
    frame #9: 0x0000155553cee8cf libamdhip64.so.6`amd::Thread::entry(amd::Thread*) + 79
    frame #10: 0x00001555523c21ca libpthread.so.0`start_thread + 234
    frame #11: 0x000015554e6fbe73 libc.so.6`__clone + 67
  thread #10, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000154cb0e63200) at mutex.cc:265:26
    frame #2: 0x00000000040cc7f4 averageTest.exec`Realm::KernelThreadTaskScheduler::worker_sleep(this=0x0000000004a00f40, switch_to=0x0000154c98249690) at tasks.cc:1469:17
    frame #3: 0x00000000040cd4ea averageTest.exec`Realm::ThreadedTaskScheduler::thread_blocking(this=0x0000000004a00f40, thread=<unavailable>) at tasks.cc:963:14
    frame #4: 0x0000000004108901 averageTest.exec`Realm::Hip::GPUfillChannel::create_xfer_des(this=<unavailable>, dma_op=<unavailable>, launch_node=<unavailable>, guid=<unavailable>, inputs_info=<unavailable>, outputs_info=<unavailable>, priority=-1740931136, redop_info=(id = -1740931136, is_fold = true, in_place = true, is_exclusive = false), fill_data=0x0000154c983b7fc0, fill_size=1, fill_total=7236833163134198317) at hip_internal.cc:1105:7
  thread #11, name = 'averageTest.exe'
    frame #0: 0x000015554e6fb9bd libc.so.6`syscall + 29
    frame #1: 0x00000000041692b5 averageTest.exec`Realm::Doorbell::wait_slow(this=0x0000154cb187c200) at mutex.cc:265:26
    frame #2: 0x00000000040c8c08 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(unsigned long) [inlined] Realm::ThreadedTaskScheduler::wait_for_work(old_work_counter=<unavailable>, this=0x0000000004a00f40) at tasks.cc:1294:33
    frame #3: 0x00000000040c8c00 averageTest.exec`Realm::ThreadedTaskScheduler::wait_for_work(this=0x0000000004a00f40, old_work_counter=<unavailable>) at tasks.cc:1275:8
    frame #4: 0x00000000040d0053 averageTest.exec`Realm::ThreadedTaskScheduler::scheduler_loop(this=0x0000000004a00f40) at tasks.cc:1260:21

Do you have any advice on what might be going wrong?

@elliottslaughter, can you please add this issue to #1032?

lightsighter commented 2 months ago

Presuming these backtraces are not changing over time, this is guaranteed to be a bug in AMD's driver. It should never be possible for a thread to be stuck in here:

frame #0: 0x0000155553d250f5 libamdhip64.so.6`bool roc::VirtualGPU::dispatchGenericAqlPacket<hsa_kernel_dispatch_packet_s>(hsa_kernel_dispatch_packet_s*, unsigned short, unsigned short, bool, unsigned long) + 645

All calls into ROCm should always return in finite time.

elliottslaughter commented 2 months ago

I agree with @lightsighter's assessment that this is likely a ROCm bug, or at least an issue with how ROCm is configured.

@mariodirenzo can you tell us more about your configuration?

What ROCm version is this?
How many GPUs (really GCDs) per node?
How many GPUs (GCDs) per process?

I know there are some variables related to resources assigned to each process that by default are not configured in an optimal way for Legion.

mariodirenzo commented 2 months ago

What ROCm version is this?

6.0.3

How many GPUs (really GCDs) per node?

This is a node of Tioga (https://hpc.llnl.gov/hardware/compute-platforms/tioga), which has 4 GPUs

How many GPUs (GCDs) per process?

I'm using one process with one GPU

elliottslaughter commented 2 months ago

Is this a C++ code? Because Regent doesn't support that ROCm version.

For what it's worth, we've been hitting a lot of ROCm issues with S3D, though our symptoms are different (crashes with an out of resource message, rather than hangs). The advice we've been given so far has been to test three things:

(easy) Set GPU_MAX_HW_QUEUES to different values to see if the problem goes away. My understanding is that the default value of this variable is 4, and the limit is 24 per GCD (note that 2 are reserved for data transfer). So you could probably try values of 8 or 16.
(medium) Set AMD_LOG_LEVEL=4 and save the log files. Note these logs will be quite large and probably can only be interpreted by an AMD staff person, so it may not make sense to do this until you get support involved.
(hard) Try different ROCm versions, since bugs get fixed and sometimes regressions get added. That hasn't helped our particular use case in S3D but it might still be worth trying for HTR.

Overall, we are probably in territory where it would be appropriate to contact Tioga support and ideally get AMD involved in helping you debug this issue.

mariodirenzo commented 1 month ago

Is this a C++ code?

Yes, this is C++ only

(easy) Set GPU_MAX_HW_QUEUES to different values to see if the problem goes away. My understanding is that the default value of this variable is 4, and the limit is 24 per GCD (note that 2 are reserved for data transfer). So you could probably try values of 8 or 16.

this didn't make any difference.

I've also noticed that the bt of thread 4 is changing. Sometimes I get

  thread #4, name = 'averageTest.exe', stop reason = signal SIGSTOP
    frame #0: 0x0000155553afd6a3 libamdhip64.so.6`amd::Monitor::unlock() + 35
    frame #1: 0x0000155553d4dafc libamdhip64.so.6`roc::KernelBlitManager::copyBufferRect(device::Memory&, device::Memory&, amd::BufferRect const&, amd::BufferRect const&, amd::Coord3D const&, bool, amd::CopyMetadata) const + 1372
    frame #2: 0x0000155553d1a08a libamdhip64.so.6`roc::VirtualGPU::copyMemory(unsigned int, amd::Memory&, amd::Memory&, bool, amd::Coord3D const&, amd::Coord3D const&, amd::Coord3D const&, amd::BufferRect const&, amd::BufferRect const&, amd::CopyMetadata) + 650
    frame #3: 0x0000155553d1be69 libamdhip64.so.6`roc::VirtualGPU::submitCopyMemory(amd::CopyMemoryCommand&) + 185
    frame #4: 0x0000155553cf3fe1 libamdhip64.so.6`amd::Command::enqueue() + 1137
    frame #5: 0x0000155553bab75e libamdhip64.so.6`ihipMemcpyParam3D(HIP_MEMCPY3D const*, ihipStream_t*, bool) + 1086
    frame #6: 0x0000155553bab93b libamdhip64.so.6`ihipMemcpyParam2D(hip_Memcpy2D const*, ihipStream_t*, bool) + 203
    frame #7: 0x0000155553baba1c libamdhip64.so.6`ihipMemcpy2D(void*, unsigned long, void const*, unsigned long, unsigned long, unsigned long, hipMemcpyKind, ihipStream_t*, bool) + 204
    frame #8: 0x0000155553bcb806 libamdhip64.so.6`hipMemcpy2DAsync + 662
    frame #9: 0x000000000410ad9c averageTest.exec`Realm::Hip::GPUfillXferDes::progress_xd(this=0x0000154c9808ffa0, channel=0x00000000051c8a90, work_until=<unavailable>) at hip_internal.cc:1021:25
    frame #10: 0x0000000004111f34 averageTest.exec`Realm::XDQueue<Realm::Hip::GPUfillChannel, Realm::Hip::GPUfillXferDes>::do_work(this=0x00000000051c8ac8, work_until=<unavailable>) at channel.inl:157:35
    frame #11: 0x0000000003fe2038 averageTest.exec`Realm::BackgroundWorkManager::Worker::do_work(this=0x0000155437ffc520, max_time_in_ns=<unavailable>, interrupt_flag=0x0000000000000000) at bgwork.cc:599:41
    frame #12: 0x0000000003fe2911 averageTest.exec`Realm::BackgroundWorkThread::main_loop(this=0x00000000050dc5b0) at bgwork.cc:103:22
    frame #13: 0x00000000040d8441 averageTest.exec`Realm::KernelThread::pthread_entry(data=0x0000000005283e90) at threads.cc:831:29
    frame #14: 0x00001555523c21ca libpthread.so.0`start_thread + 234
    frame #15: 0x000015554e6fbe73 libc.so.6`__clone + 67

sometimes I get

  thread #4, name = 'averageTest.exe'
    frame #0: 0x000015554e75b40b libc.so.6`sysmalloc + 379
    frame #1: 0x000015554e75c840 libc.so.6`_int_malloc + 3392
    frame #2: 0x000015554e75d972 libc.so.6`malloc + 498
    frame #3: 0x000015554ed35d7c libstdc++.so.6`operator new(unsigned long) + 28
    frame #4: 0x0000155553baf790 libamdhip64.so.6`ihipMemset3DCommand(std::vector<amd::Command*, std::allocator<amd::Command*>>&, hipPitchedPtr, int, hipExtent, hip::Stream*, unsigned long) + 368
    frame #5: 0x0000155553baf9c4 libamdhip64.so.6`ihipMemset3D(hipPitchedPtr, int, hipExtent, ihipStream_t*, bool) + 212
    frame #6: 0x0000155553bafb8b libamdhip64.so.6`hipMemset2DAsync_common(void*, unsigned long, int, unsigned long, unsigned long, ihipStream_t*) + 171
    frame #7: 0x0000155553bec222 libamdhip64.so.6`hipMemset2DAsync + 466
    frame #8: 0x000000000410a55a averageTest.exec`Realm::Hip::GPUfillXferDes::progress_xd(this=0x0000154c9808ffa0, channel=0x00000000051c8a90, work_until=<unavailable>) at hip_internal.cc:953:19
    frame #9: 0x0000000004111f34 averageTest.exec`Realm::XDQueue<Realm::Hip::GPUfillChannel, Realm::Hip::GPUfillXferDes>::do_work(this=0x00000000051c8ac8, work_until=<unavailable>) at channel.inl:157:35
    frame #10: 0x0000000003fe2038 averageTest.exec`Realm::BackgroundWorkManager::Worker::do_work(this=0x0000155437ffc520, max_time_in_ns=<unavailable>, interrupt_flag=0x0000000000000000) at bgwork.cc:599:41
    frame #11: 0x0000000003fe2911 averageTest.exec`Realm::BackgroundWorkThread::main_loop(this=0x00000000050dc5b0) at bgwork.cc:103:22
    frame #12: 0x00000000040d8441 averageTest.exec`Realm::KernelThread::pthread_entry(data=0x0000000005283e90) at threads.cc:831:29
    frame #13: 0x00001555523c21ca libpthread.so.0`start_thread + 234
    frame #14: 0x000015554e6fbe73 libc.so.6`__clone + 67

elliottslaughter commented 1 month ago

If these backtraces are changing within a single run, that would indicate that the code is not deadlocked but is running very slowly.

I don't know if this is still applicable, but at one point fills on HIP we're known to be extremely slow: https://github.com/StanfordLegion/legion/issues/1236

I haven't had the opportunity to check any recent HIP versions to see if it got fixed, but that seems like a relatively self contained test you could do.

mariodirenzo commented 1 month ago

If these backtraces are changing within a single run, that would indicate that the code is not deadlocked but is running very slowly.

I'm not sure about it. The test should take approximately 0.6s and I've run it for more than 30 minutes without getting any progress. So, it is running slowly, it is incredibly slow.

Every time I extract a backtrace, I see thread 4 in this function either at this line https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/hip/hip_internal.cc#L951 or at https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/hip/hip_internal.cc#L1019

lightsighter commented 1 month ago

The test should take approximately 0.6s and I've run it for more than 30 minutes without getting any progress. So, it is running slowly, it is incredibly slow.

What makes you think it should run in 0.6s? Is that time from an NVIDIA machine?

mariodirenzo commented 1 month ago

What makes you think it should run in 0.6s? Is that time from an NVIDIA machine?

I'm running a lot of similar unit tests. Those that run to completion are executed in approximately 0.6s, which is also the time it takes to run the tests on NVIDIA machines

elliottslaughter commented 1 month ago

Let me see if I understand. On AMD GPUs, you have some unit tests that finish in 0.6 seconds, but this particular one (which is similar to at least some of the others) does not complete in 30+ minutes. (And all of the unit tests pass in a short amount of time on NVIDIA hardware.)

Assuming this is the case, I guess you could do some delta debugging to figure out what's unique or different about the freezing test. The smaller the test case (and the smaller the difference to another working test case), the more likely it is that we'll be able to spot the root cause.

eddy16112 commented 1 month ago

We can try to run the slow test on NVIDIA GPU with HIP_TARGET=CUDA to see if it is an issue of realm hip module or AMD driver.

lightsighter commented 1 month ago

Given the description of the symptoms and the backtraces above, I suspect what is happening is that you're hitting one of the un-optimized DMA pathways in the Realm HIP module. The Realm CUDA module has had significant work put into it by people at NVIDIA to optimize DMA transfers and push them into CUDA kernels where possible. A DMA transfer that used to do 1M cudaMemcpy calls and take multiple minutes now is turned into a single CUDA kernel that does 1M loads and stores and takes effectively zero time. Optimizations like that have not been done in the HIP module (and cannot be done by anyone on the Realm team at NVIDIA). The suggestion by @eddy16112 will give us a good indication if that is the case.

mariodirenzo commented 1 month ago

Let me see if I understand. On AMD GPUs, you have some unit tests that finish in 0.6 seconds, but this particular one (which is similar to at least some of the others) does not complete in 30+ minutes. (And all of the unit tests pass in a short amount of time on NVIDIA hardware.)

that's right.

We can try to run the slow test on NVIDIA GPU with HIP_TARGET=CUDA to see if it is an issue of realm hip module or AMD driver.

Sure, I'll try it

StanfordLegion / legion

Deadlock when running with AMD GPUs #1688