OpenCL hot loop (100% one thread) when using two command queues with profiling

preda commented 8 months ago

On Ubuntu 22.04, kernel 6.7.9, ROCm 6.1.0 (RC), Radeon Pro VII.

In brief: when creating a second command-queue (that is not even used at all) one thread starts eating 100% CPU, i.e. doing busy-wait. The performance of the other command queue is impacted as well.

Below, I'm going to compare the "normal" situation that is observed when using a single command-queue, vs what is observed when creating a second command queue (hot-loop).

Using a plain mono-threaded OpenCL app with a single host command-queue (in-order, with profiling enabled), with the queue accessed only from the main thread, this is the thread layout that I see ("the normal")

(gdb) i th
  Id   Target Id                                      Frame 
* 1    Thread 0x7ffff7c5a080 (LWP 142088) "prpll-amd" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7fffffffd450, op=393, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:57
  2    Thread 0x7fffee200640 (LWP 142091) "prpll-amd" __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
  5    Thread 0x7ffff797d640 (LWP 142094) "prpll-amd" __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7fffffffd450, op=393, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x7fffffffd450, clockid=0, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x555555691068, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7fffffffd450, private=<optimized out>) at ./nptl/futex-internal.c:139
#3  0x00007ffff769c9b0 in do_futex_wait (sem=sem@entry=0x555555691068, abstime=abstime@entry=0x7fffffffd450, clockid=0) at ./nptl/sem_waitcommon.c:111
#4  0x00007ffff769ca53 in __new_sem_wait_slow64 (sem=0x555555691068, abstime=0x7fffffffd450, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x00007ffff7ecadd1 in ?? () from /opt/rocm/lib/libamdocl64.so
#6  0x00007ffff7ecab2e in ?? () from /opt/rocm/lib/libamdocl64.so
#7  0x00007ffff7eb2298 in ?? () from /opt/rocm/lib/libamdocl64.so
#8  0x00007ffff7eb64c2 in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7e6c1f9 in clFinish () from /opt/rocm/lib/libamdocl64.so
#10 0x00005555555b4cb3 in finish (q=0x555555940260) at src/clwrap.cpp:296
#11 0x00005555555669ef in Queue::finish (this=0x5555556a2660) at src/Queue.cpp:76
#12 0x000055555559fec6 in Gpu::isPrimePRP (this=0x555555940750, args=..., task=...) at src/Gpu.cpp:1145
#13 0x00005555555bbde3 in Task::execute (this=0x7fffffffd9b0, context=..., args=...) at src/Task.cpp:163
#14 0x00005555555935d2 in threadWorker (args=..., context=..., instance=1) at src/main.cpp:25
#15 0x0000555555593b70 in main (argc=7, argv=0x7fffffffdec8) at src/main.cpp:78
(gdb) t 2
[Switching to thread 2 (Thread 0x7fffee200640 (LWP 142091))]
#0  __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
36  ../sysdeps/unix/sysv/linux/ioctl.c: No such file or directory.
(gdb) bt
#0  __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00007fffee91a560 in kmtIoctl (fd=4, request=request@entry=3222817548, arg=arg@entry=0x7fffee1f8220) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/libhsakmt.c:13
#2  0x00007fffee913330 in hsaKmtWaitOnMultipleEvents_Ext (event_age=0x7fffee1f82d0, Milliseconds=4294967294, WaitOnAll=false, NumEvents=<optimized out>, Events=0x7fffee1f8350) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:409
#3  hsaKmtWaitOnMultipleEvents_Ext (Events=0x7fffee1f8350, NumEvents=4, WaitOnAll=<optimized out>, Milliseconds=4294967294, event_age=0x7fffee1f82d0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:380
#4  0x00007fffee89c70c in rocr::core::Signal::WaitAny (signal_count=signal_count@entry=9, hsa_signals=hsa_signals@entry=0x7fffe8001150, conds=conds@entry=0x7fffe8000cd0, values=values@entry=0x7fffe80011e0, timeout=timeout@entry=18446744073709551615, wait_hint=<optimized out>, wait_hint@entry=HSA_WAIT_STATE_BLOCKED, satisfying_value=<optimized out>)
    at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/signal.cpp:324
#5  0x00007fffee8613f6 in rocr::AMD::hsa_amd_signal_wait_any (signal_count=9, hsa_signals=0x7fffe8001150, conds=0x7fffe8000cd0, values=0x7fffe80011e0, timeout_hint=timeout_hint@entry=18446744073709551615, wait_hint=wait_hint@entry=HSA_WAIT_STATE_BLOCKED, satisfying_value=0x7fffee1f8488)
    at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa_ext_amd.cpp:587
#6  0x00007fffee879aca in rocr::core::Runtime::AsyncEventsLoop () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/runtime.cpp:1473
#7  0x00007fffee82a85b in rocr::os::ThreadTrampoline (arg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/lnx/os_linux.cpp:81
#8  0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#9  0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) t 5
[Switching to thread 5 (Thread 0x7ffff797d640 (LWP 142094))]
#0  __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
36  in ../sysdeps/unix/sysv/linux/ioctl.c
(gdb) bt
#0  __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00007fffee91a560 in kmtIoctl (fd=4, request=request@entry=3222817548, arg=arg@entry=0x7ffff79751c0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/libhsakmt.c:13
#2  0x00007fffee913330 in hsaKmtWaitOnMultipleEvents_Ext (event_age=0x7ffff79752b8, Milliseconds=4294967294, WaitOnAll=true, NumEvents=<optimized out>, Events=0x7ffff7975278) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:409
#3  hsaKmtWaitOnMultipleEvents_Ext (Events=0x7ffff7975278, NumEvents=1, WaitOnAll=<optimized out>, Milliseconds=4294967294, event_age=0x7ffff79752b8) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:380
#4  0x00007fffee913adb in hsaKmtWaitOnEvent_Ext (Event=<optimized out>, Milliseconds=<optimized out>, event_age=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:226
#5  0x00007fffee8655d0 in rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9e027bd40, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=<optimized out>, wait_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:243
#6  0x00007fffee8652ae in rocr::core::InterruptSignal::WaitAcquire (this=<optimized out>, condition=<optimized out>, compare_value=<optimized out>, timeout=<optimized out>, wait_hint=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:251
#7  0x00007fffee859ef9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa.cpp:1220
#8  0x00007ffff7eea454 in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7eeb9fe in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eefef8 in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#12 0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#13 0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#14 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#15 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#16 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

When adding a second command queue (but without using it at all), the thread layout becomes:

(gdb) i th
  Id   Target Id                                      Frame 
* 1    Thread 0x7ffff7c5a080 (LWP 141830) "prpll-amd" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7fffffffd440, op=393, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:57
  2    Thread 0x7fffee200640 (LWP 141833) "prpll-amd" __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
  5    Thread 0x7ffff797d640 (LWP 141836) "prpll-amd" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7ffff7975330, op=393, expected=0, futex_word=0x5555559406a8) at ./nptl/futex-internal.c:57
  6    Thread 0x7ffff790f640 (LWP 141837) "prpll-amd" 0x00007fffee865451 in rocr::__rdtsc () at /usr/lib/gcc/x86_64-linux-gnu/11/include/ia32intrin.h:114
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7fffffffd440, op=393, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x7fffffffd440, clockid=0, expected=0, futex_word=0x555555691068) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x555555691068, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7fffffffd440, private=<optimized out>) at ./nptl/futex-internal.c:139
#3  0x00007ffff769c9b0 in do_futex_wait (sem=sem@entry=0x555555691068, abstime=abstime@entry=0x7fffffffd440, clockid=0) at ./nptl/sem_waitcommon.c:111
#4  0x00007ffff769ca53 in __new_sem_wait_slow64 (sem=0x555555691068, abstime=0x7fffffffd440, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x00007ffff7ecadd1 in ?? () from /opt/rocm/lib/libamdocl64.so
#6  0x00007ffff7ecab2e in ?? () from /opt/rocm/lib/libamdocl64.so
#7  0x00007ffff7eb2298 in ?? () from /opt/rocm/lib/libamdocl64.so
#8  0x00007ffff7eb64c2 in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7e6c1f9 in clFinish () from /opt/rocm/lib/libamdocl64.so
#10 0x00005555555b4cff in finish (q=0x5555556a26e0) at src/clwrap.cpp:296
#11 0x00005555555669ef in Queue::finish (this=0x555555944ed0) at src/Queue.cpp:76
#12 0x000055555559ff12 in Gpu::isPrimePRP (this=0x555555940b10, args=..., task=...) at src/Gpu.cpp:1145
#13 0x00005555555bbe2f in Task::execute (this=0x7fffffffd9a0, context=..., args=...) at src/Task.cpp:163
#14 0x00005555555935d2 in threadWorker (args=..., context=..., instance=1) at src/main.cpp:25
#15 0x0000555555593b92 in main (argc=7, argv=0x7fffffffdec8) at src/main.cpp:78
(gdb) t 2
[Switching to thread 2 (Thread 0x7fffee200640 (LWP 141833))]
#0  __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
36  ../sysdeps/unix/sysv/linux/ioctl.c: No such file or directory.
(gdb) bt
#0  __GI___ioctl (fd=fd@entry=4, request=request@entry=3222817548) at ../sysdeps/unix/sysv/linux/ioctl.c:36
#1  0x00007fffee91a560 in kmtIoctl (fd=4, request=request@entry=3222817548, arg=arg@entry=0x7fffee1f8220) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/libhsakmt.c:13
#2  0x00007fffee913330 in hsaKmtWaitOnMultipleEvents_Ext (event_age=0x7fffee1f82d0, Milliseconds=4294967294, WaitOnAll=false, NumEvents=<optimized out>, Events=0x7fffe80015d0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:409
#3  hsaKmtWaitOnMultipleEvents_Ext (Events=0x7fffe80015d0, NumEvents=4, WaitOnAll=<optimized out>, Milliseconds=4294967294, event_age=0x7fffee1f82d0) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/libhsakmt/src/events.c:380
#4  0x00007fffee89c70c in rocr::core::Signal::WaitAny (signal_count=signal_count@entry=11, hsa_signals=hsa_signals@entry=0x7fffe80011f0, conds=conds@entry=0x7fffe8000d70, values=values@entry=0x7fffe8001280, timeout=timeout@entry=18446744073709551615, wait_hint=<optimized out>, wait_hint@entry=HSA_WAIT_STATE_BLOCKED, satisfying_value=<optimized out>)
    at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/signal.cpp:324
#5  0x00007fffee8613f6 in rocr::AMD::hsa_amd_signal_wait_any (signal_count=11, hsa_signals=0x7fffe80011f0, conds=0x7fffe8000d70, values=0x7fffe8001280, timeout_hint=timeout_hint@entry=18446744073709551615, wait_hint=wait_hint@entry=HSA_WAIT_STATE_BLOCKED, satisfying_value=0x7fffee1f8488)
    at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa_ext_amd.cpp:587
#6  0x00007fffee879aca in rocr::core::Runtime::AsyncEventsLoop () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/runtime.cpp:1473
#7  0x00007fffee82a85b in rocr::os::ThreadTrampoline (arg=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/lnx/os_linux.cpp:81
#8  0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#9  0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) t 3
Unknown thread 3.
(gdb) t 5
[Switching to thread 5 (Thread 0x7ffff797d640 (LWP 141836))]
#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7ffff7975330, op=393, expected=0, futex_word=0x5555559406a8) at ./nptl/futex-internal.c:57
57  ./nptl/futex-internal.c: No such file or directory.
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x7ffff7975330, op=393, expected=0, futex_word=0x5555559406a8) at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x7ffff7975330, clockid=0, expected=0, futex_word=0x5555559406a8) at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x5555559406a8, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x7ffff7975330, private=<optimized out>) at ./nptl/futex-internal.c:139
#3  0x00007ffff769c9b0 in do_futex_wait (sem=sem@entry=0x5555559406a8, abstime=abstime@entry=0x7ffff7975330, clockid=0) at ./nptl/sem_waitcommon.c:111
#4  0x00007ffff769ca53 in __new_sem_wait_slow64 (sem=0x5555559406a8, abstime=0x7ffff7975330, clockid=0) at ./nptl/sem_waitcommon.c:183
#5  0x00007ffff7ecadd1 in ?? () from /opt/rocm/lib/libamdocl64.so
#6  0x00007ffff7ecab2e in ?? () from /opt/rocm/lib/libamdocl64.so
#7  0x00007ffff7eb5d58 in ?? () from /opt/rocm/lib/libamdocl64.so
#8  0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) t 6
[Switching to thread 6 (Thread 0x7ffff790f640 (LWP 141837))]
#0  0x00007fffee865451 in rocr::__rdtsc () at /usr/lib/gcc/x86_64-linux-gnu/11/include/ia32intrin.h:114
114   return __builtin_ia32_rdtsc ();
(gdb) bt
#0  0x00007fffee865451 in rocr::__rdtsc () at /usr/lib/gcc/x86_64-linux-gnu/11/include/ia32intrin.h:114
#1  rocr::timer::fast_clock::raw_now () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/timer.h:149
#2  rocr::timer::fast_clock::now () at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/timer.h:140
#3  rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9d803bbe0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=<optimized out>, wait_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:211
#4  0x00007fffee8652ae in rocr::core::InterruptSignal::WaitAcquire (this=<optimized out>, condition=<optimized out>, compare_value=<optimized out>, timeout=<optimized out>, wait_hint=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:251
#5  0x00007fffee859ef9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa.cpp:1220
#6  0x00007ffff7eea454 in ?? () from /opt/rocm/lib/libamdocl64.so
#7  0x00007ffff7eeb9fe in ?? () from /opt/rocm/lib/libamdocl64.so
#8  0x00007ffff7eefef8 in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#12 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#13 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#14 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

The problem is created by the last thread above (6) which is eating 100% CPU being caugh up in a hot loop. To see a few other points in this loop:

#0  0x00007fffee8655bf in rocr::timer::duration_cast<std::chrono::duration<long, std::ratio<1, 1000> >, double, std::ratio<1, 1000000000000> > (d=...) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/util/timer.h:78
#1  rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9d80ac890, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=<optimized out>, wait_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:240
#2  0x00007fffee8652ae in rocr::core::InterruptSignal::WaitAcquire (this=<optimized out>, condition=<optimized out>, compare_value=<optimized out>, timeout=<optimized out>, wait_hint=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:251
#3  0x00007fffee859ef9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa.cpp:1220
#4  0x00007ffff7eea454 in ?? () from /opt/rocm/lib/libamdocl64.so
#5  0x00007ffff7eeb9fe in ?? () from /opt/rocm/lib/libamdocl64.so
#6  0x00007ffff7eefef8 in ?? () from /opt/rocm/lib/libamdocl64.so
#7  0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#8  0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

#0  rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9d801adf0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=<optimized out>, wait_hint=HSA_WAIT_STATE_BLOCKED) at /usr/include/c++/11/chrono:521
#1  0x00007fffee8652ae in rocr::core::InterruptSignal::WaitAcquire (this=<optimized out>, condition=<optimized out>, compare_value=<optimized out>, timeout=<optimized out>, wait_hint=<optimized out>) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/interrupt_signal.cpp:251
#2  0x00007fffee859ef9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /long_pathname_so_that_rpms_can_package_the_debug_info/src/hsa/runtime/opensrc/hsa-runtime/core/runtime/hsa.cpp:1220
#3  0x00007ffff7eea454 in ?? () from /opt/rocm/lib/libamdocl64.so
#4  0x00007ffff7eeb9fe in ?? () from /opt/rocm/lib/libamdocl64.so
#5  0x00007ffff7eefef8 in ?? () from /opt/rocm/lib/libamdocl64.so
#6  0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#7  0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#8  0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#11 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

All the command queues are in-order and with profiling enabled. The situation is pretty severe, basically precluding the possibility of using more than one command queue.

preda commented 8 months ago

To repro you may check-out: https://github.com/preda/gpuowl/tree/a20d77aa5d06832942d0fec59c690c739a4e7098

(e.g. by cloning the gpuowl project, and checking out the hash above)

Build with "make" in the source dir, run with

./build-debug/prpll -d 0 -prp 118063003 -verbose

And observe the CPU usage of the process. (change "-d 0" to select a different GPU).

Now do a new compilation, after enabling the second queue -- in main.cpp, look for #if ENABLE_SECOND_QUEUE and either define that symbol or change the condition to be true.

Run the new build (with the second queue enabled), and observe difference in CPU usage and in performance. (my observed performance is OpenCL about 30% slower with the second queue enabled).

Note that the second queue is not actually used at all.

preda commented 7 months ago

For the repro above, it is also needed to create a file "work-1.txt" in the work dir, like this: echo PRP=118845473 > work-1.txt

preda commented 7 months ago

I may have found a clue. In the "hot" thread (the one that is continously busy, 100%), at this location: https://github.com/ROCm/ROCR-Runtime/blob/17b904f609f3e048f7765156c1a9b1ed62cec962/src/core/runtime/interrupt_signal.cpp#L243 the member event_ is always null, which has the effect that the call to

hsaKmtWaitOnEvent_Ext(event_, wait_ms, &event_age);

returns immediately, which makes the loop inside WaitRelaxed() hot.

In contrast, in the other similar thread, the event_ isn't null, and the loop waits inside hsaKmtWaitOnEvent_Ext() making it not-hot.

So: why creating a second queue produces this adverse situation?

If confirmed, this seems like a rather serious bug as it precludes (in practice) the use of a second queue. This combined with the queue not supporting out-of-order (as per https://github.com/ROCm/clr/issues/67 ) does not leave any alternatives.

preda commented 7 months ago

In the good thread:

(gdb) bt
#0  rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9e0020cc0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=18446744073709551615, wait_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/interrupt_signal.cpp:243
#1  0x00007fffee8af824 in rocr::core::InterruptSignal::WaitAcquire (this=0x7ff9e0020cc0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=18446744073709551615, wait_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/interrupt_signal.cpp:251
#2  0x00007fffee8976e9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/hsa.cpp:1220
#3  0x00007fffee907c2d in hsa_signal_wait_scacquire (signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_expectancy_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/common/hsa_table_interface.cpp:341
#4  0x00007ffff7eea4b4 in ?? () from /opt/rocm/lib/libamdocl64.so
#5  0x00007ffff7eeba5e in ?? () from /opt/rocm/lib/libamdocl64.so
#6  0x00007ffff7eeff58 in ?? () from /opt/rocm/lib/libamdocl64.so
#7  0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#8  0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

(gdb) p this->event_
$27 = (HsaEvent *) 0x7ff9e0020d10
(gdb) p *this->event_
$28 = {EventId = 929, EventData = {EventType = HSA_EVENTTYPE_SIGNAL, EventData = {SyncVar = {SyncVar = {UserData = 0x0, UserDataPtrValue = 0}, SyncVarSize = 8}, NodeChangeState = {Flags = HSA_EVENTTYPE_NODECHANGE_ADD}, DeviceState = {NodeId = 0, Device = HSA_DEVICE_CPU, Flags = 8}, MemoryAccessFault = {NodeId = 0, VirtualAddress = 34359738368, Failure = {NotPresent = 0, 
          ReadOnly = 0, NoExecute = 0, GpuAccess = 0, ECC = 0, Imprecise = 0, ErrorType = 0, Reserved = 0}, Flags = HSA_EVENTID_MEMORY_RECOVERABLE}, HwException = {NodeId = 0, ResetType = 0, MemoryLost = 8, ResetCause = HSA_EVENTID_HW_EXCEPTION_GPU_HANG}}, HWData1 = 929, HWData2 = 140737353751816, HWData3 = 929}}

And in the hot thread:

(gdb) bt
#0  rocr::core::InterruptSignal::WaitRelaxed (this=0x7ff9d00175d0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=18446744073709551615, wait_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/interrupt_signal.cpp:243
#1  0x00007fffee8af824 in rocr::core::InterruptSignal::WaitAcquire (this=0x7ff9d00175d0, condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout=18446744073709551615, wait_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/interrupt_signal.cpp:251
#2  0x00007fffee8976e9 in rocr::HSA::hsa_signal_wait_scacquire (hsa_signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_state_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/runtime/hsa.cpp:1220
#3  0x00007fffee907c2d in hsa_signal_wait_scacquire (signal=..., condition=HSA_SIGNAL_CONDITION_LT, compare_value=1, timeout_hint=18446744073709551615, wait_expectancy_hint=HSA_WAIT_STATE_BLOCKED) at /home/preda/ROCR-Runtime/src/core/common/hsa_table_interface.cpp:341
#4  0x00007ffff7eea4b4 in ?? () from /opt/rocm/lib/libamdocl64.so
#5  0x00007ffff7eeba5e in ?? () from /opt/rocm/lib/libamdocl64.so
#6  0x00007ffff7eeff58 in ?? () from /opt/rocm/lib/libamdocl64.so
#7  0x00007ffff7eb5c93 in ?? () from /opt/rocm/lib/libamdocl64.so
#8  0x00007ffff7eb713f in ?? () from /opt/rocm/lib/libamdocl64.so
#9  0x00007ffff7e4d5b0 in ?? () from /opt/rocm/lib/libamdocl64.so
#10 0x00007ffff7eaa157 in ?? () from /opt/rocm/lib/libamdocl64.so
#11 0x00007ffff7694ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#12 0x00007ffff7726850 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) p this->event_
$29 = (HsaEvent *) 0x0

shwetagkhatri commented 7 months ago

Thank you for reporting this hot thread issue with the second command queue handling in ROCm and your detailed analysis, preda. We've assigned this issue for immediate triage.

shwetagkhatri commented 7 months ago

@preda -Like you have pointed out here- https://github.com/preda/clr/commit/03858366ea2dc96d504a16f7002e1781560e2b8a, its true that kernel driver can only support a maximum of 4096 interrupt signals per device. The failure to allocate any further interrupt events leads to this hot loop. This issue is going to be addressed by - https://github.com/ROCm/clr/pull/71#issuecomment-2018811810

preda commented 7 months ago

@shwetagkhatri while indeed what exposed the problem was CLR allocating too many events to their per-queue pool (and their fix there is welcome), the question remains what is the right behavior when the amount of kernel interrupt signals is exausted. Which is a situation that may still happen even after their fix.

For what concerns this issue, feel free to close it as soon as the fix makes it into a public release.

preda commented 7 months ago

In fact I think the problem with the "hot queue" is not fully fixed by the CLR reducing their pool size; I found a different scenario that reproduces it even after the CLR fix. Here it is:

create the first queue (Q1)
enqueue a large number of kernels, let's say 10'000 on Q1, each kernel launched with completion event
don't consume the completion events early (enqueue all 10'000 kernels first)
finish() on the first queue, and release all the completion events
create the second queue (Q2)

At this point we have the hot thread again on the second queue. Let me explain why: when many kernels are launched with completion events, InterruptSignal::EventPool::alloc() allocates all available HW events. What's more, as these events are released by the client, they are not released back to the kernel but are kept in the cache of the EventPool ("events_"). So even after the client release, when the time to create Q2 comes there are no HW events available and the thread becomes hot as in the earlier scenario.

So the problem is not just the CLR pool. There are other ways to exhaust the kernel HW events, and the Queue can't function acceptably then.

I think there are two things to do:

The queue should error-out loudly if it can't allocate at least one event, instead of silently trying to function without event.
secondarilly, InterruptSignal::EventPool should also release events back to the kernel in some situations, to prevent kernel event exaustion even though the client does release all the events.

saleelk commented 7 months ago

Yes, with enough signal usage or when profiling we may exhaust the number of interrupt signals because runtime only creates interrupt signals. I have a change in mind to mitigate this but its a bit more intricate. The idea is to only use interrupt signals for the cases where we need to wait, its usually the last kernel/barrier in the batch for ex k0, k1, b1, k2, copy1, k3, ....kN, bN. Only the last kernel or bN which we queue needs to have an interrupt signal because thats what the async handler thread will wait on. Copies or rest of the kernels in the batch doesnt need interrupt signals

ROCm / ROCR-Runtime

OpenCL hot loop (100% one thread) when using two command queues with profiling #186