Open preda opened 8 months ago
To repro you may check-out: https://github.com/preda/gpuowl/tree/a20d77aa5d06832942d0fec59c690c739a4e7098
(e.g. by cloning the gpuowl project, and checking out the hash above)
Build with "make" in the source dir, run with
./build-debug/prpll -d 0 -prp 118063003 -verbose
And observe the CPU usage of the process. (change "-d 0" to select a different GPU).
Now do a new compilation, after enabling the second queue -- in main.cpp, look for #if ENABLE_SECOND_QUEUE and either define that symbol or change the condition to be true.
Run the new build (with the second queue enabled), and observe difference in CPU usage and in performance. (my observed performance is OpenCL about 30% slower with the second queue enabled).
Note that the second queue is not actually used at all.
For the repro above, it is also needed to create a file "work-1.txt" in the work dir, like this:
echo PRP=118845473 > work-1.txt
I may have found a clue. In the "hot" thread (the one that is continously busy, 100%), at this location:
https://github.com/ROCm/ROCR-Runtime/blob/17b904f609f3e048f7765156c1a9b1ed62cec962/src/core/runtime/interrupt_signal.cpp#L243
the member event_
is always null, which has the effect that the call to
hsaKmtWaitOnEvent_Ext(event_, wait_ms, &event_age);
returns immediately, which makes the loop inside WaitRelaxed() hot.
In contrast, in the other similar thread, the event_ isn't null, and the loop waits inside hsaKmtWaitOnEvent_Ext() making it not-hot.
So: why creating a second queue produces this adverse situation?
If confirmed, this seems like a rather serious bug as it precludes (in practice) the use of a second queue. This combined with the queue not supporting out-of-order (as per https://github.com/ROCm/clr/issues/67 ) does not leave any alternatives.
In the good thread:
(gdb) bt
#0 rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9e0020cc0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=18446744073709551615, wait_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/interrupt_signal.cpp:243
#1 0x00007fffee8af824 in rocr::core::InterruptSignal::WaitAcquire (this=0x7ff9e0020cc0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=18446744073709551615, wait_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/interrupt_signal.cpp:251
#2 0x00007fffee8976e9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/hsa.cpp:1220
#3 0x00007fffee907c2d in hsa_signal_wait_scacquire (signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_expectancy_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/common/hsa_table_interface.cpp:341
#4 0x00007ffff7eea4b4 in ?? () from /opt/rocm/lib/libamdocl64.so
#5 0x00007ffff7eeba5e in ?? () from /opt/rocm/lib/libamdocl64.so
#6 0x00007ffff7eeff58 in ?? () from /opt/rocm/lib/libamdocl64.so
#7 0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#8 0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#9 0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) p this->event_
$27 = (HsaEvent *) 0x7ff9e0020d10
(gdb) p *this->event_
$28 = {EventId = 929, EventData = {EventType = HSA_EVENTTYPE_SIGNAL, EventData = {SyncVar = {SyncVar = {UserData = 0x0, UserDataPtrValue = 0}, SyncVarSize = 8}, NodeChangeState = {Flags = HSA_EVENTTYPE_NODECHANGE_ADD}, DeviceState = {NodeId = 0, Device = HSA_DEVICE_CPU, Flags = 8}, MemoryAccessFault = {NodeId = 0, VirtualAddress = 34359738368, Failure = {NotPresent = 0,
ReadOnly = 0, NoExecute = 0, GpuAccess = 0, ECC = 0, Imprecise = 0, ErrorType = 0, Reserved = 0}, Flags = HSA_EVENTID_MEMORY_RECOVERABLE}, HwException = {NodeId = 0, ResetType = 0, MemoryLost = 8, ResetCause = HSA_EVENTID_HW_EXCEPTION_GPU_HANG}}, HWData1 = 929, HWData2 = 140737353751816, HWData3 = 929}}
And in the hot thread:
(gdb) bt
#0 rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9d00175d0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=18446744073709551615, wait_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/interrupt_signal.cpp:243
#1 0x00007fffee8af824 in rocr::core::InterruptSignal::WaitAcquire (this=0x7ff9d00175d0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=18446744073709551615, wait_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/interrupt_signal.cpp:251
#2 0x00007fffee8976e9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/hsa.cpp:1220
#3 0x00007fffee907c2d in hsa_signal_wait_scacquire (signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_expectancy_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/common/hsa_table_interface.cpp:341
#4 0x00007ffff7eea4b4 in ?? () from /opt/rocm/lib/libamdocl64.so
#5 0x00007ffff7eeba5e in ?? () from /opt/rocm/lib/libamdocl64.so
#6 0x00007ffff7eeff58 in ?? () from /opt/rocm/lib/libamdocl64.so
#7 0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#8 0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#9 0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) p this->event_
$29 = (HsaEvent *) 0x0
Thank you for reporting this hot thread issue with the second command queue handling in ROCm and your detailed analysis, preda. We've assigned this issue for immediate triage.
@preda -Like you have pointed out here- https://github.com/preda/clr/commit/03858366ea2dc96d504a16f7002e1781560e2b8a, its true that kernel driver can only support a maximum of 4096 interrupt signals per device. The failure to allocate any further interrupt events leads to this hot loop. This issue is going to be addressed by - https://github.com/ROCm/clr/pull/71#issuecomment-2018811810
@shwetagkhatri while indeed what exposed the problem was CLR allocating too many events to their per-queue pool (and their fix there is welcome), the question remains what is the right behavior when the amount of kernel interrupt signals is exausted. Which is a situation that may still happen even after their fix.
For what concerns this issue, feel free to close it as soon as the fix makes it into a public release.
In fact I think the problem with the "hot queue" is not fully fixed by the CLR reducing their pool size; I found a different scenario that reproduces it even after the CLR fix. Here it is:
At this point we have the hot thread again on the second queue. Let me explain why: when many kernels are launched with completion events, InterruptSignal::EventPool::alloc() allocates all available HW events. What's more, as these events are released by the client, they are not released back to the kernel but are kept in the cache of the EventPool ("events_"). So even after the client release, when the time to create Q2 comes there are no HW events available and the thread becomes hot as in the earlier scenario.
So the problem is not just the CLR pool. There are other ways to exhaust the kernel HW events, and the Queue can't function acceptably then.
I think there are two things to do:
Yes, with enough signal usage or when profiling we may exhaust the number of interrupt signals because runtime only creates interrupt signals. I have a change in mind to mitigate this but its a bit more intricate. The idea is to only use interrupt signals for the cases where we need to wait, its usually the last kernel/barrier in the batch for ex k0, k1, b1, k2, copy1, k3, ....kN, bN. Only the last kernel or bN which we queue needs to have an interrupt signal because thats what the async handler thread will wait on. Copies or rest of the kernels in the batch doesnt need interrupt signals
On Ubuntu 22.04, kernel 6.7.9, ROCm 6.1.0 (RC), Radeon Pro VII.
In brief: when creating a second command-queue (that is not even used at all) one thread starts eating 100% CPU, i.e. doing busy-wait. The performance of the other command queue is impacted as well.
Below, I'm going to compare the "normal" situation that is observed when using a single command-queue, vs what is observed when creating a second command queue (hot-loop).
Using a plain mono-threaded OpenCL app with a single host command-queue (in-order, with profiling enabled), with the queue accessed only from the main thread, this is the thread layout that I see ("the normal")
When adding a second command queue (but without using it at all), the thread layout becomes:
The problem is created by the last thread above (6) which is eating 100% CPU being caugh up in a hot loop. To see a few other points in this loop:
All the command queues are in-order and with profiling enabled. The situation is pretty severe, basically precluding the possibility of using more than one command queue.