ROCm / rocprofiler

ROC profiler library. Profiling with perf-counters and derived metrics.
https://rocm.docs.amd.com/projects/rocprofiler/en/latest/
MIT License
132 stars 49 forks source link

Intercept mode deadlocks with multiple threads driving separate GPUs #113

Closed gcongiu closed 1 month ago

gcongiu commented 1 year ago

Running intercept_multi_thread_monitoring test in papi (located in papi/src/components/rocm/tests/intercept_multi_thread_monitoring) deadlocks. The test runs a matrix-to-matrix multiplication kernel and uses multiple threads, each driving and monitoring its own dedicated GPU. The backtrace from rocgdb follows:

#0  0x00007ffff599e54d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007ffff5999eb6 in _L_lock_941 () from /lib64/libpthread.so.0
#2  0x000000010440c741 in ?? ()
#3  0x00007ffeeb9999d1 in rocprofiler::util::HsaRsrcFactory::GetKernelNameRef(unsigned long) () from /opt/rocm-5.5.0/lib/librocprofiler64.so
#4  0x00007ffeeb982f33 in rocprofiler::InterceptQueue::QueryKernelName(unsigned long, amd_kernel_code_s const*) ()
   from /opt/rocm-5.5.0/lib/librocprofiler64.so
#5  0x00007ffeeb9898e5 in rocprofiler::InterceptQueue::OnSubmitCB(void const*, unsigned long, unsigned long, void*, void (*)(void const*, unsigned long)) () from /opt/rocm-5.5.0/lib/librocprofiler64.so
#6  0x00007fffece1aafa in rocr::core::InterceptQueue::StoreRelaxed(long) () from /opt/rocm-5.5.0/lib/libhsa-runtime64.so.1
#7  0x00007fffece0d9a8 in rocr::HSA::hsa_signal_store_screlease(hsa_signal_s, long) () from /opt/rocm-5.5.0/lib/libhsa-runtime64.so.1
#8  0x00007ffff66850c4 in bool roc::VirtualGPU::dispatchGenericAqlPacket<hsa_kernel_dispatch_packet_s>(hsa_kernel_dispatch_packet_s*, unsigned short, unsigned short, bool, unsigned long) () from /opt/rocm-5.5.0/lib/libamdhip64.so.5
#9  0x00007ffff66828d7 in roc::VirtualGPU::submitKernelInternal(amd::NDRangeContainer const&, amd::Kernel const&, unsigned char const*, void*, unsigned int, amd::NDRangeKernelCommand*, hsa_kernel_dispatch_packet_s*) () from /opt/rocm-5.5.0/lib/libamdhip64.so.5
#10 0x00007ffff6683638 in roc::VirtualGPU::submitKernel(amd::NDRangeKernelCommand&) () from /opt/rocm-5.5.0/lib/libamdhip64.so.5
#11 0x00007ffff6654d1a in amd::Command::enqueue() () from /opt/rocm-5.5.0/lib/libamdhip64.so.5
#12 0x00007ffff657b723 in ihipModuleLaunchKernel(ihipModuleSymbol_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, unsigned int, ihipStream_t*, void**, void**, ihipEvent_t*, ihipEvent_t*, unsigned int, unsigned int, unsigned int, unsigned int, unsigned long, unsigned long, unsigned int) () from /opt/rocm-5.5.0/lib/libamdhip64.so.5
#13 0x00007ffff65a27a7 in ihipLaunchKernel(void const*, dim3, dim3, void**, unsigned long, ihipStream_t*, ihipEvent_t*, ihipEvent_t*, int) ()
   from /opt/rocm-5.5.0/lib/libamdhip64.so.5
#14 0x00007ffff657b5a2 in hipLaunchKernel_common () from /opt/rocm-5.5.0/lib/libamdhip64.so.5
#15 0x00007ffff6589e12 in hipLaunchKernel () from /opt/rocm-5.5.0/lib/libamdhip64.so.5
#16 0x0000000000473597 in __device_stub__matmul(float*, float*, float*, int) ()
#17 0x0000000000473aa5 in hip_do_matmul_work (handle=0xa88530, stream=0xa1f610) at matmul.cpp:111
#18 0x00000000004730fc in .omp_outlined._debug__(int &, const char *(&)[4], int &, hipError_t &) const (.global_tid.=0x7fffffffc670,
    .bound_tid.=0x7fffffffc668, papi_errno=@0x7fffffffca6c: 0, events=..., pass_with_warning=@0x7fffffffca68: 0,
    hip_errno=@0x7fffffffca64: hipSuccess) at multi_thread_monitoring.cpp:103
#19 0x00000000004734cd in .omp_outlined.(void) const (.global_tid.=0x7fffffffc670, .bound_tid.=0x7fffffffc668, papi_errno=@0x7fffffffca6c: 0,
    events=..., pass_with_warning=@0x7fffffffca68: 0, hip_errno=@0x7fffffffca64: hipSuccess) at multi_thread_monitoring.cpp:63
#20 0x00007ffff7fcbf43 in __kmp_invoke_microtask () from /opt/rocm-5.5.0/llvm/bin/../lib/libomp.so
#21 0x00007ffff7f5177f in __kmp_invoke_task_func () from /opt/rocm-5.5.0/llvm/bin/../lib/libomp.so
#22 0x00007ffff7f4b85b in __kmp_fork_call () from /opt/rocm-5.5.0/llvm/bin/../lib/libomp.so
#23 0x00007ffff7f3cb05 in __kmpc_fork_call () from /opt/rocm-5.5.0/llvm/bin/../lib/libomp.so
#24 0x0000000000472da2 in multi_thread (argc=1, argv=0x7fffffffcb88) at multi_thread_monitoring.cpp:63
#25 0x00000000003eaf80 in main (argc=1, argv=0x7fffffffcb88) at intercept_multi_thread_monitoring.cpp:13

The version of ROCm used to reproduce this problem is 5.5.0 RC5. Test was ran on two MI210s.

gcongiu commented 1 year ago

I verified this with rocm-5.5.0 stable release and the problem is also present there.

gcongiu commented 1 year ago

@ammarwa any update on this?

gcongiu commented 1 year ago

Added reproducer issue-113.tar.gz

ppanchad-amd commented 3 months ago

@gcongiu Apologies for the lack of response. Can you please check if your issue still exists with the latest ROCm 6.2? If so, we will further investigate the issue. Thanks!

sohaibnd commented 1 month ago

Hi @gcongiu, I have not been able to reproduce this on ROCm 6.2 using 2 MI210s and the reproducer code you provided. Can you try upgrading to ROCm 6.2 and check if the issue persists?

sohaibnd commented 1 month ago

@gcongiu I'm going to close this issue due to inactivity. If the issue is still present, feel free to re-open the ticket and we can look into it further.