ROCm / ROCR-Runtime

ROCm Platform Runtime: ROCr a HPC market enhanced HSA based runtime
https://rocm.docs.amd.com/projects/ROCR-Runtime/en/latest/
Other
223 stars 109 forks source link

core::Signal::WaitAny never get the signal #96

Closed Kelvin-Ng closed 2 weeks ago

Kelvin-Ng commented 4 years ago

I am trying to run OpenCL applications on the APU that comes with Ryzen 3400G (i.e., gfx902+xnack). Sometimes it will work and return me correct results, but sometimes it blocks forever. I traced it with gdb, and I find that the main thread is blocked at clCreateCommandQueue() waiting for events:

#1  do_futex_wait (sem=sem@entry=0x627148, abstime=0x0) at sem_waitcommon.c:111
#2  0x00007ffff6f6c8d4 in __new_sem_wait_slow (sem=0x627148, abstime=0x0) at sem_waitcommon.c:181
#3  0x00007ffff6f6c97a in __new_sem_wait (sem=<optimized out>) at sem_wait.c:29
#4  0x00007ffff688df60 in amd::Semaphore::wait() () from /opt/rocm/lib/../opencl/lib/libamdocl64.so
#5  0x00007ffff688dd4a in amd::Monitor::wait() () from /opt/rocm/lib/../opencl/lib/libamdocl64.so
#6  0x00007ffff68a3bb4 in amd::HostQueue::HostQueue(amd::Context&, amd::Device&, unsigned long, unsigned int, amd::CommandQueue::Priority) ()
   from /opt/rocm/lib/../opencl/lib/libamdocl64.so
#7  0x00007ffff687fa56 in clCreateCommandQueueWithProperties () from /opt/rocm/lib/../opencl/lib/libamdocl64.so
#8  0x00007ffff687fd28 in clCreateCommandQueue () from /opt/rocm/lib/../opencl/lib/libamdocl64.so
#9  0x0000000000401193 in main ()

It is waiting for updates from another thread, which is thread 2 in gdb. The trace of thread 2 is like this:

#0  0x00007ffff7276f47 in ioctl () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007ffff6330654 in kmtIoctl () from /usr/local/lib/libhsakmt.so.1
#2  0x00007ffff632a84f in hsaKmtWaitOnMultipleEvents () from /usr/local/lib/libhsakmt.so.1
#3  0x00007ffff65aa2a3 in core::Signal::WaitAny(unsigned int, hsa_signal_s const*, hsa_signal_condition_t const*, long const*, unsigned long, hsa_wait_state_t, long*) ()
   from /opt/rocm/lib/../opencl/lib/../../lib/libhsa-runtime64.so.1
#4  0x00007ffff6591516 in AMD::hsa_amd_signal_wait_any(unsigned int, hsa_signal_s*, hsa_signal_condition_t*, long*, unsigned long, hsa_wait_state_t, long*) ()
   from /opt/rocm/lib/../opencl/lib/../../lib/libhsa-runtime64.so.1
#5  0x00007ffff65a256a in core::Runtime::AsyncEventsLoop(void*) () from /opt/rocm/lib/../opencl/lib/../../lib/libhsa-runtime64.so.1
#6  0x00007ffff6561217 in os::ThreadTrampoline(void*) () from /opt/rocm/lib/../opencl/lib/../../lib/libhsa-runtime64.so.1
#7  0x00007ffff6f646ba in start_thread (arg=0x7ffff5adf700) at pthread_create.c:333
#8  0x00007ffff728141d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

If I run with HSA_ENABLE_INTERRUPT=0, I get the following instead:

#0  0x00007ffff66e6232 in core::Signal::WaitAny(unsigned int, hsa_signal_s const*, hsa_signal_condition_t const*, long const*, unsigned long, hsa_wait_state_t, long*) ()
   from /opt/rocm/lib/libhsa-runtime64.so.1
#1  0x00007ffff66cd516 in AMD::hsa_amd_signal_wait_any(unsigned int, hsa_signal_s*, hsa_signal_condition_t*, long*, unsigned long, hsa_wait_state_t, long*) ()
   from /opt/rocm/lib/libhsa-runtime64.so.1
#2  0x00007ffff66de56a in core::Runtime::AsyncEventsLoop(void*) () from /opt/rocm/lib/libhsa-runtime64.so.1
#3  0x00007ffff669d217 in os::ThreadTrampoline(void*) () from /opt/rocm/lib/libhsa-runtime64.so.1
#4  0x00007ffff79ab6ba in start_thread (arg=0x7ffeef514700) at pthread_create.c:333
#5  0x00007ffff6c5841d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

Doing some printf tricks shows that the program is stuck at an infinite loop because it never receives the signal.

Originally I thought it was a problem of ROCT and so I posted an issue here: https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface/issues/56. Following the suggestion, I run with HSA_ENABLE_SDMA=0 but it does not change anything.

I am quite sure that it is not directly related to OpenCL. I also tried to run HIP applications (compiled with an unofficial hipcc hacked by me) and a similar problem happens. The thread launching kernels gets blocked here:

#0  0x00007ffff6c3b827 in sched_yield () at ../sysdeps/unix/syscall-template.S:84
#1  0x00007ffff66b2c65 in amd::AqlQueue::ExecutePM4(unsigned int*, unsigned long) () from /opt/rocm/lib/libhsa-runtime64.so.1
#2  0x00007ffff66a87e8 in amd::GpuAgent::InvalidateCodeCaches() () from /opt/rocm/lib/libhsa-runtime64.so.1
#3  0x00007ffff66b50be in amd::LoaderContext::SegmentAlloc(amdgpu_hsa_elf_segment_t, hsa_agent_s, unsigned long, unsigned long, bool) ()
   from /opt/rocm/lib/libhsa-runtime64.so.1
#4  0x00007ffff66eed15 in amd::hsa::loader::ExecutableImpl::LoadSegmentsV2(hsa_agent_s, amd::hsa::code::AmdHsaCode const*) () from /opt/rocm/lib/libhsa-runtime64.so.1
#5  0x00007ffff66ef40e in amd::hsa::loader::ExecutableImpl::LoadSegments(hsa_agent_s, amd::hsa::code::AmdHsaCode const*, unsigned int) ()
   from /opt/rocm/lib/libhsa-runtime64.so.1
#6  0x00007ffff66f24f3 in amd::hsa::loader::ExecutableImpl::LoadCodeObject(hsa_agent_s, hsa_code_object_s, unsigned long, char const*, hsa_loaded_code_object_s*) ()
   from /opt/rocm/lib/libhsa-runtime64.so.1
#7  0x00007ffff66c73e7 in HSA::hsa_executable_load_agent_code_object(hsa_executable_s, hsa_agent_s, hsa_code_object_reader_s, char const*, hsa_loaded_code_object_s*) ()
   from /opt/rocm/lib/libhsa-runtime64.so.1
#8  0x00007ffff742f845 in roc::LightningProgram::setKernels(amd::option::Options*, void*, unsigned long) () from /opt/rocm-3.5.1/hip/lib/libamdhip64.so.3
#9  0x00007ffff73d8d45 in device::Program::linkImplLC(amd::option::Options*) () from /opt/rocm-3.5.1/hip/lib/libamdhip64.so.3
#10 0x00007ffff73d9975 in device::Program::build(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, char const*, amd::option::Options*) () from /opt/rocm-3.5.1/hip/lib/libamdhip64.so.3
#11 0x00007ffff73ec5fc in amd::Program::build(std::vector<amd::Device*, std::allocator<amd::Device*> > const&, char const*, void (*)(_cl_program*, void*), void*, bool) ()
   from /opt/rocm-3.5.1/hip/lib/libamdhip64.so.3
#12 0x00007ffff7377e15 in PlatformState::getFunc(void const*, int) () from /opt/rocm-3.5.1/hip/lib/libamdhip64.so.3
#13 0x00007ffff7380653 in hipLaunchKernel () from /opt/rocm-3.5.1/hip/lib/libamdhip64.so.3
#14 0x0000000000400f8e in main ()

Thread 2 in gdb shows the same trace as OpenCL.

When I run HIP applications, sometimes it can pass this part but get blocked at hipDeviceSynchronize() instead. Thread 2 shows the exact same trace. However, if the first hipDeviceSynchronize() can pass, other hipDeviceSynchronize() always pass.

To conclude,

  1. Thread 2 is waiting for a signal, but it never receives it
  2. The main thread get blocked only because it does not get update from thread 2
  3. The same problem can happen for OpenCL or HIP, for creating command queue, kernel launch, or device sync
  4. The problem is random. Sometimes the program (both OpenCL and HIP) can finish and return me correct results

Any help in fixing the bug is appreciated. Thank you.

ppanchad-amd commented 2 months ago

@Kelvin-Ng Apologies for the lack of response. Do you still need assistance with this ticket? If not, please close the ticket. Thanks!

ppanchad-amd commented 2 weeks ago

@Kelvin-Ng Closing ticket. Please feel free to re-open ticket if you still need assistance. Thanks!