Open mariodirenzo opened 2 months ago
Presuming these backtraces are not changing over time, this is guaranteed to be a bug in AMD's driver. It should never be possible for a thread to be stuck in here:
frame #0: 0x0000155553d250f5 libamdhip64.so.6`bool roc::VirtualGPU::dispatchGenericAqlPacket<hsa_kernel_dispatch_packet_s>(hsa_kernel_dispatch_packet_s*, unsigned short, unsigned short, bool, unsigned long) + 645
All calls into ROCm should always return in finite time.
I agree with @lightsighter's assessment that this is likely a ROCm bug, or at least an issue with how ROCm is configured.
@mariodirenzo can you tell us more about your configuration?
I know there are some variables related to resources assigned to each process that by default are not configured in an optimal way for Legion.
What ROCm version is this?
6.0.3
How many GPUs (really GCDs) per node?
This is a node of Tioga (https://hpc.llnl.gov/hardware/compute-platforms/tioga), which has 4 GPUs
How many GPUs (GCDs) per process?
I'm using one process with one GPU
Is this a C++ code? Because Regent doesn't support that ROCm version.
For what it's worth, we've been hitting a lot of ROCm issues with S3D, though our symptoms are different (crashes with an out of resource message, rather than hangs). The advice we've been given so far has been to test three things:
GPU_MAX_HW_QUEUES
to different values to see if the problem goes away. My understanding is that the default value of this variable is 4, and the limit is 24 per GCD (note that 2 are reserved for data transfer). So you could probably try values of 8 or 16.AMD_LOG_LEVEL=4
and save the log files. Note these logs will be quite large and probably can only be interpreted by an AMD staff person, so it may not make sense to do this until you get support involved.Overall, we are probably in territory where it would be appropriate to contact Tioga support and ideally get AMD involved in helping you debug this issue.
Is this a C++ code?
Yes, this is C++ only
(easy) Set GPU_MAX_HW_QUEUES to different values to see if the problem goes away. My understanding is that the default value of this variable is 4, and the limit is 24 per GCD (note that 2 are reserved for data transfer). So you could probably try values of 8 or 16.
this didn't make any difference.
I've also noticed that the bt of thread 4 is changing. Sometimes I get
thread #4, name = 'averageTest.exe', stop reason = signal SIGSTOP
frame #0: 0x0000155553afd6a3 libamdhip64.so.6`amd::Monitor::unlock() + 35
frame #1: 0x0000155553d4dafc libamdhip64.so.6`roc::KernelBlitManager::copyBufferRect(device::Memory&, device::Memory&, amd::BufferRect const&, amd::BufferRect const&, amd::Coord3D const&, bool, amd::CopyMetadata) const + 1372
frame #2: 0x0000155553d1a08a libamdhip64.so.6`roc::VirtualGPU::copyMemory(unsigned int, amd::Memory&, amd::Memory&, bool, amd::Coord3D const&, amd::Coord3D const&, amd::Coord3D const&, amd::BufferRect const&, amd::BufferRect const&, amd::CopyMetadata) + 650
frame #3: 0x0000155553d1be69 libamdhip64.so.6`roc::VirtualGPU::submitCopyMemory(amd::CopyMemoryCommand&) + 185
frame #4: 0x0000155553cf3fe1 libamdhip64.so.6`amd::Command::enqueue() + 1137
frame #5: 0x0000155553bab75e libamdhip64.so.6`ihipMemcpyParam3D(HIP_MEMCPY3D const*, ihipStream_t*, bool) + 1086
frame #6: 0x0000155553bab93b libamdhip64.so.6`ihipMemcpyParam2D(hip_Memcpy2D const*, ihipStream_t*, bool) + 203
frame #7: 0x0000155553baba1c libamdhip64.so.6`ihipMemcpy2D(void*, unsigned long, void const*, unsigned long, unsigned long, unsigned long, hipMemcpyKind, ihipStream_t*, bool) + 204
frame #8: 0x0000155553bcb806 libamdhip64.so.6`hipMemcpy2DAsync + 662
frame #9: 0x000000000410ad9c averageTest.exec`Realm::Hip::GPUfillXferDes::progress_xd(this=0x0000154c9808ffa0, channel=0x00000000051c8a90, work_until=<unavailable>) at hip_internal.cc:1021:25
frame #10: 0x0000000004111f34 averageTest.exec`Realm::XDQueue<Realm::Hip::GPUfillChannel, Realm::Hip::GPUfillXferDes>::do_work(this=0x00000000051c8ac8, work_until=<unavailable>) at channel.inl:157:35
frame #11: 0x0000000003fe2038 averageTest.exec`Realm::BackgroundWorkManager::Worker::do_work(this=0x0000155437ffc520, max_time_in_ns=<unavailable>, interrupt_flag=0x0000000000000000) at bgwork.cc:599:41
frame #12: 0x0000000003fe2911 averageTest.exec`Realm::BackgroundWorkThread::main_loop(this=0x00000000050dc5b0) at bgwork.cc:103:22
frame #13: 0x00000000040d8441 averageTest.exec`Realm::KernelThread::pthread_entry(data=0x0000000005283e90) at threads.cc:831:29
frame #14: 0x00001555523c21ca libpthread.so.0`start_thread + 234
frame #15: 0x000015554e6fbe73 libc.so.6`__clone + 67
sometimes I get
thread #4, name = 'averageTest.exe'
frame #0: 0x000015554e75b40b libc.so.6`sysmalloc + 379
frame #1: 0x000015554e75c840 libc.so.6`_int_malloc + 3392
frame #2: 0x000015554e75d972 libc.so.6`malloc + 498
frame #3: 0x000015554ed35d7c libstdc++.so.6`operator new(unsigned long) + 28
frame #4: 0x0000155553baf790 libamdhip64.so.6`ihipMemset3DCommand(std::vector<amd::Command*, std::allocator<amd::Command*>>&, hipPitchedPtr, int, hipExtent, hip::Stream*, unsigned long) + 368
frame #5: 0x0000155553baf9c4 libamdhip64.so.6`ihipMemset3D(hipPitchedPtr, int, hipExtent, ihipStream_t*, bool) + 212
frame #6: 0x0000155553bafb8b libamdhip64.so.6`hipMemset2DAsync_common(void*, unsigned long, int, unsigned long, unsigned long, ihipStream_t*) + 171
frame #7: 0x0000155553bec222 libamdhip64.so.6`hipMemset2DAsync + 466
frame #8: 0x000000000410a55a averageTest.exec`Realm::Hip::GPUfillXferDes::progress_xd(this=0x0000154c9808ffa0, channel=0x00000000051c8a90, work_until=<unavailable>) at hip_internal.cc:953:19
frame #9: 0x0000000004111f34 averageTest.exec`Realm::XDQueue<Realm::Hip::GPUfillChannel, Realm::Hip::GPUfillXferDes>::do_work(this=0x00000000051c8ac8, work_until=<unavailable>) at channel.inl:157:35
frame #10: 0x0000000003fe2038 averageTest.exec`Realm::BackgroundWorkManager::Worker::do_work(this=0x0000155437ffc520, max_time_in_ns=<unavailable>, interrupt_flag=0x0000000000000000) at bgwork.cc:599:41
frame #11: 0x0000000003fe2911 averageTest.exec`Realm::BackgroundWorkThread::main_loop(this=0x00000000050dc5b0) at bgwork.cc:103:22
frame #12: 0x00000000040d8441 averageTest.exec`Realm::KernelThread::pthread_entry(data=0x0000000005283e90) at threads.cc:831:29
frame #13: 0x00001555523c21ca libpthread.so.0`start_thread + 234
frame #14: 0x000015554e6fbe73 libc.so.6`__clone + 67
If these backtraces are changing within a single run, that would indicate that the code is not deadlocked but is running very slowly.
I don't know if this is still applicable, but at one point fills on HIP we're known to be extremely slow: https://github.com/StanfordLegion/legion/issues/1236
I haven't had the opportunity to check any recent HIP versions to see if it got fixed, but that seems like a relatively self contained test you could do.
If these backtraces are changing within a single run, that would indicate that the code is not deadlocked but is running very slowly.
I'm not sure about it. The test should take approximately 0.6s and I've run it for more than 30 minutes without getting any progress. So, it is running slowly, it is incredibly slow.
Every time I extract a backtrace, I see thread 4 in this function either at this line https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/hip/hip_internal.cc#L951 or at https://gitlab.com/StanfordLegion/legion/-/blob/master/runtime/realm/hip/hip_internal.cc#L1019
The test should take approximately 0.6s and I've run it for more than 30 minutes without getting any progress. So, it is running slowly, it is incredibly slow.
What makes you think it should run in 0.6s? Is that time from an NVIDIA machine?
What makes you think it should run in 0.6s? Is that time from an NVIDIA machine?
I'm running a lot of similar unit tests. Those that run to completion are executed in approximately 0.6s
, which is also the time it takes to run the tests on NVIDIA machines
Let me see if I understand. On AMD GPUs, you have some unit tests that finish in 0.6 seconds, but this particular one (which is similar to at least some of the others) does not complete in 30+ minutes. (And all of the unit tests pass in a short amount of time on NVIDIA hardware.)
Assuming this is the case, I guess you could do some delta debugging to figure out what's unique or different about the freezing test. The smaller the test case (and the smaller the difference to another working test case), the more likely it is that we'll be able to spot the root cause.
We can try to run the slow test on NVIDIA GPU with HIP_TARGET=CUDA to see if it is an issue of realm hip module or AMD driver.
Given the description of the symptoms and the backtraces above, I suspect what is happening is that you're hitting one of the un-optimized DMA pathways in the Realm HIP module. The Realm CUDA module has had significant work put into it by people at NVIDIA to optimize DMA transfers and push them into CUDA kernels where possible. A DMA transfer that used to do 1M cudaMemcpy calls and take multiple minutes now is turned into a single CUDA kernel that does 1M loads and stores and takes effectively zero time. Optimizations like that have not been done in the HIP module (and cannot be done by anyone on the Realm team at NVIDIA). The suggestion by @eddy16112 will give us a good indication if that is the case.
Let me see if I understand. On AMD GPUs, you have some unit tests that finish in 0.6 seconds, but this particular one (which is similar to at least some of the others) does not complete in 30+ minutes. (And all of the unit tests pass in a short amount of time on NVIDIA hardware.)
that's right.
We can try to run the slow test on NVIDIA GPU with HIP_TARGET=CUDA to see if it is an issue of realm hip module or AMD driver.
Sure, I'll try it
When running HTR++ unit tests on Tioga, some tests freeze without any error message. The freeze is deterministic and happens only when an AMD gpu is utilized. The backtraces of a hanging execution look like this:
Do you have any advice on what might be going wrong?
@elliottslaughter, can you please add this issue to #1032?