ROCm / roctracer

ROCm Tracer Callback/Activity Library for Performance tracing AMD GPUs
https://rocm.docs.amd.com/projects/roctracer/en/latest/
Other
65 stars 30 forks source link

Rocm/Roctracer will hang and crash when interrupted by a real-time timer #22

Closed mxz297 closed 2 years ago

mxz297 commented 4 years ago

We encountered a hang when using Roctracer to collect real-time profiling data on both CPUs and GPUs. HPCToolkit collects real time profiling data by repetitively setting a real-time timer and register a signal handler to record samples when the timer goes off. Our example application (the roctracer example) hangs non-deterministically in ioctl.

I created a reproducer based on the Roctracer example, which contains only the real-timer logics without any other HPCToolkit logic.

If we compile this reproducer and run it, it will hang non-deterministically at the following stack trace:

#0  0x00007f7ad29bd5d7 in ioctl () at ../sysdeps/unix/syscall-template.S:78
#1  0x00007f7acd0f3f28 in kmtIoctl () from /opt/rocm/lib/libhsakmt.so.1
#2  0x00007f7acd0ee36f in hsaKmtWaitOnMultipleEvents () from /opt/rocm/lib/libhsakmt.so.1
#3  0x00007f7acd0ee929 in hsaKmtWaitOnEvent () from /opt/rocm/lib/libhsakmt.so.1
#4  0x00007f7ad38d694a in core::InterruptSignal::WaitRelaxed(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#5  0x00007f7ad38d674a in core::InterruptSignal::WaitAcquire(hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#6  0x00007f7ad38c8ad9 in HSA::hsa_signal_wait_scacquire(hsa_signal_s, hsa_signal_condition_t, long, unsigned long, hsa_wait_state_t) () from /opt/rocm/hsa/lib/libhsa-runtime64.so.1
#7  0x00007f7abf3c3b31 in waitComplete () at /data/jenkins_workspace/compute-rocm-rel-2.9/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:4930
#8  0x00007f7abf3c37cd in operator() () at /data/jenkins_workspace/compute-rocm-rel-2.9/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:5070
#9  _M_invoke<> () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/functional:1530
#10 operator() () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/functional:1520
#11 operator() () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:1342
#12 0x00007f7abf3c3762 in _M_invoke () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/functional:1856
#13 0x00007f7abf3c36d7 in operator() () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/functional:2267
#14 _M_do_set () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:527
#15 0x00007f7ad47f8827 in __pthread_once_slow (once_control=0x7f7a9d0eb1d8, init_routine=0x7f7ad2f6c830 <__once_proxy>) at pthread_once.c:116
#16 0x00007f7abf3c436b in __gthread_once () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/x86_64-linux-gnu/c++/5.4.0/bits/gthr-default.h:699
#17 call_once<void (std::__future_base::_State_baseV2::*)(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *), std::__future_base::_State_baseV2 *, std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()> *, bool *> () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/mutex:738
#18 _M_set_result () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:386
#19 _M_complete_async () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:1606
#20 0x00007f7abf3a87ce in wait () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:319
#21 wait () at /usr/lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/future:656
#22 wait () at /data/jenkins_workspace/compute-rocm-rel-2.9/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:1628
#23 0x00007f7abf3a7ef5 in copy_ext () at /data/jenkins_workspace/compute-rocm-rel-2.9/external/hcc-tot/lib/hsa/mcwamp_hsa.cpp:4144
#24 0x00007f7ad3b7fa15 in ihipStream_t::locked_copySync(void*, void const*, unsigned long, unsigned int, bool) () from /opt/rocm/hip/lib/libhip_hcc.so
#25 0x00007f7ad3bb9f0e in hipMemcpy () from /opt/rocm/hip/lib/libhip_hcc.so
#26 0x000000000040622d in main () at MatrixTranspose.cpp:183

If we change the timer interval from 2.5 millisecond to 1 millisecond, the reproduce will deterministically crash at the following stack trace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fc0de253801 in __GI_abort () at abort.c:79
#2  0x00007fc0defd8844 in ?? () from ../../build/libroctracer64.so.1
#3  0x00007fc0df51a367 in api_callbacks_spawner_t<3>::api_callbacks_spawner_t(hip_api_id_t const&, hip_api_data_t&) () from /opt/rocm/hip/lib/libhip_hcc.so
#4  0x00007fc0df519e4f in hipMalloc () from /opt/rocm/hip/lib/libhip_hcc.so
#5  0x00000000004060cf in main () at MatrixTranspose.cpp:154

These stack traces include code from roctracer, and also code from other AMD toolchains such as HIP. I cannot really determine whether the root cause of the issue is in HIP or Roctracer, but since HPCToolkit is a direct user of Roctracer, I will post it here.

Reproducer.zip

mxz297 commented 4 years ago

Replacing real-time timers with a cpu time timer will still hang. This suggests that handling of E_INTR for system calls is not the issue.

eshcherb commented 4 years ago

Could you upload the test with cpu time timer?

mxz297 commented 4 years ago

You can replace the a real-time timer with a cpu timer by changing line 65 of the "MatrixTranpose.cpp".

I have a section of comments before line 65 stating how to do it:

    /*  REALTIME timer can interrupt syscalls, but CPUTIME timer will not.
     *  Replacing CLOCK_REALTIME with CLOCK_PROCESS_CPUTTIME_ID to measure
     *  CPU time will still hang
     */
mxz297 commented 4 years ago

@eshcherb We found that the hang has nothing to do with roctracer. In the example attached below, the code does not link against roctrocer and still hangs in the same way. This means that the bug is in the rocm runtime. Where should I report this bug? Or can you forward my reproducer to the responsible developers?

rocm-hang.tar.gz

eshcherb commented 4 years ago

Thank you for your update, appreciate! I also tried that and I already created an internal AMD ticket for KFD. Will keep you updated on the progress.

eshcherb commented 4 years ago

While we are looking the issue, could you try a workaround: $ export HSA_ENABLE_INTERRUPT=0

The setting will force ROCr-runtime/KFD to use polling to wait for signals rather than interrupts.

ROCmSupport commented 2 years ago

Hi @mxz297 Can you please validate with the latest ROCm 4.5 code and update the status. Thank you.

ROCmSupport commented 2 years ago

Hi @mxz297 I am not able to reproduce with the latest ROCm 4.5. Request to check with the latest ROCm 4.5 code. Thank you.

mxz297 commented 2 years ago

The reproducer no longer compiles with a recent rocm. I tweaked it to compile with rocm-4.3.1 and it does not seem to have the problem any more.

jrmadsen commented 2 years ago

@ROCmSupport I was able to reproduce this issue w/ ROCm 4.5.0

jmellorcrummey commented 2 years ago

@ROCmSupport I observed this issue with ROCm 4.5.2 using hpctoolkit's hpcrun to sample miniqmc. The application deadlocked with a thread stuck in a callstack like that above ending in ioctl (I didn't save the callstack). @jrmadsen Do you have a good reproducer or should we try to build one? I think that any thread that calls HSA::hsa_signal_wait_scacquire is vulnerable.

jrmadsen commented 2 years ago

@jmellorcrummey I haven't created a reproducer yet. For the time being, I've essentially resorted to just setting HSA_ENABLE_INTERRUPT=0 as the default when the library gets loaded:

namespace
{
int disable_hsa_interrupt_on_load = setenv("HSA_ENABLE_INTERRUPT", "0", 0);
}

It is not ideal and definitely causes the CPU utilization to increase but at least it prevents the deadlock. I did delve a little deeper into it.

Testing

I noticed a possible issue with the implementation in ROCT-Thunk-Interface where errno wasn't being saved in kmtIoctl (which is called from HSA::hsa_signal_wait_scacquire):

int kmtIoctl(int fd, unsigned long request, void *arg)
{
    int ret;

    do {
        ret = ioctl(fd, request, arg);
    } while (ret == -1 && (errno == EINTR || errno == EAGAIN));

    if (ret == -1 && errno == EBADF) {
        /* In case pthread_atfork didn't catch it, this will
         * make any subsequent hsaKmt calls fail in CHECK_KFD_OPEN.
         */
        pr_err("KFD file descriptor not valid in this process\n");
        is_forked_child();
    }

    return ret;
}

and thought maybe this was happening when the signal handler was overwriting errno but correcting that and installing a custom ROCR-Runtime built against it didn't fix the problem.

I think that any thread that calls HSA::hsa_signal_wait_scacquire is vulnerable.

It appears it is the ioctl call. Interrupting HSA::hsa_signal_wait_scacquire appears fine (maybe thats what you meant).

Once I re-implemented kmtIoctl to block signals around the ioctl call, removed my env setting, and cranked up SIGALRM and SIGPROF to 200 interrupts per second (which virtually guaranteed a deadlock before this change), the deadlocks disappeared:

#include <signal.h>
#include <pthread.h>

static __thread sigset_t _signal_set;

static void setup_signal_set(void)
{
    static __thread size_t _once = 0;
    if(_once != 0) return;
    _once = 1;

    sigemptyset(&_signal_set);
    sigaddset(&_signal_set, SIGPROF);
    sigaddset(&_signal_set, SIGALRM);
    sigaddset(&_signal_set, SIGVTALRM);
}

/* Call ioctl, restarting if it is interrupted */
int kmtIoctl(int fd, unsigned long request, void *arg)
{
    int ret = 0;
    int err = 0;

    setup_signal_set();
    pthread_sigmask(SIG_BLOCK, &_signal_set, NULL);

    do
    {
        ret = ioctl(fd, request, arg);
        err = errno;
    }
    while(ret == -1 && (err == EINTR || err == EAGAIN));

    if (ret == -1 && err == EBADF) {
        /* In case pthread_atfork didn't catch it, this will
         * make any subsequent hsaKmt calls fail in CHECK_KFD_OPEN.
         */
        pr_err("KFD file descriptor not valid in this process\n");
        is_forked_child();
    }

    pthread_sigmask(SIG_UNBLOCK, &_signal_set, NULL);

    return ret;
}

Thus, as far as I can tell, it appears that when the signal is delivered during the ret = ioctl(fd, request, arg), it causes ioctl to never return (e.g. in another experiment, I put a print statement right after this line and never see a print statement when the deadlock happens).

EDIT: as a sanity check, I removed the signal blocking and all my tests resumed deadlocking.

jrmadsen commented 2 years ago

As you can see from this snippet from one of the outputs, there were several potential cases for the signal handler to interrupt the ioctl call:

|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                                                                             WALL CLOCK TIME (VIA SAMPLING)                                                                                                            |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                          LABEL                                                           | COUNT  | DEPTH  |       METRIC        | UNITS  |   SUM     |   MEAN   |   MIN    |   MAX    |   VAR    | STDDEV   | % SELF |
|--------------------------------------------------------------------------------------------------------------------------|--------|--------|---------------------|--------|-----------|----------|----------|----------|----------|----------|--------|
...
| |0>>> |_hipLaunchKernel                                                                                                  |   2337 |      1 | sampling_wall_clock | sec    | 14.395847 | 0.006160 | 0.001834 | 0.207173 | 0.000041 | 0.006394 |    0.0 |
| |0>>>   |_hip_impl::hipLaunchKernelGGLImpl(unsigned long, dim3 const&, dim3 const&, unsigned int, ihipStream_t*, void**) |   2282 |      2 | sampling_wall_clock | sec    | 13.963639 | 0.006119 | 0.001836 | 0.207173 | 0.000040 | 0.006350 |    0.0 |
| |0>>>     |_hipModuleGetTexRef                                                                                           |   2282 |      3 | sampling_wall_clock | sec    | 13.963639 | 0.006119 | 0.001836 | 0.207173 | 0.000040 | 0.006350 |    0.0 |
| |0>>>       |_hipTexObjectCreate                                                                                         |   2258 |      4 | sampling_wall_clock | sec    | 13.788658 | 0.006107 | 0.001836 | 0.207173 | 0.000041 | 0.006377 |    0.0 |
| |0>>>         |_hipTexObjectCreate                                                                                       |   2255 |      5 | sampling_wall_clock | sec    | 13.774659 | 0.006108 | 0.001836 | 0.207173 | 0.000041 | 0.006381 |    0.1 |
| |0>>>           |___new_sem_wait_slow.constprop.0                                                                        |   2016 |      6 | sampling_wall_clock | sec    | 10.615110 | 0.005265 | 0.001845 | 0.207173 | 0.000023 | 0.004846 |    0.0 |
| |0>>>             |_do_futex_wait.constprop.0                                                                            |   2015 |      7 | sampling_wall_clock | sec    | 10.610925 | 0.005266 | 0.001845 | 0.207173 | 0.000023 | 0.004847 |  100.0 |
| |0>>>           |_hipTexObjectCreate                                                                                     |    236 |      6 | sampling_wall_clock | sec    |  3.147407 | 0.013336 | 0.001836 | 0.090864 | 0.000131 | 0.011427 |    0.4 |
| |0>>>             |_hipTexObjectCreate                                                                                   |    233 |      7 | sampling_wall_clock | sec    |  3.134396 | 0.013452 | 0.001836 | 0.090864 | 0.000131 | 0.011454 |    0.1 |
| |0>>>               |_hipTexObjectCreate                                                                                 |    231 |      8 | sampling_wall_clock | sec    |  3.125229 | 0.013529 | 0.001836 | 0.090864 | 0.000132 | 0.011473 |    0.0 |
| |0>>>                 |_hipTexObjectCreate                                                                               |    229 |      9 | sampling_wall_clock | sec    |  3.085253 | 0.013473 | 0.001836 | 0.090864 | 0.000130 | 0.011421 |    0.0 |
| |0>>>                   |_hipTexObjectCreate                                                                             |    229 |     10 | sampling_wall_clock | sec    |  3.085253 | 0.013473 | 0.001836 | 0.090864 | 0.000130 | 0.011421 |    0.0 |
| |0>>>                     |_rocr::HSA::hsa_signal_wait_scacquire(hsa_signal_s, hsa_signal_condition_t, long, unsigned... |    227 |     11 | sampling_wall_clock | sec    |  3.070101 | 0.013525 | 0.001836 | 0.090864 | 0.000131 | 0.011451 |    0.0 |
| |0>>>                       |_rocr::core::InterruptSignal::WaitAcquire(hsa_signal_condition_t, long, unsigned long, h... |    227 |     12 | sampling_wall_clock | sec    |  3.070101 | 0.013525 | 0.001836 | 0.090864 | 0.000131 | 0.011451 |    0.0 |
| |0>>>                         |_rocr::core::InterruptSignal::WaitRelaxed(hsa_signal_condition_t, long, unsigned long,... |    227 |     13 | sampling_wall_clock | sec    |  3.070101 | 0.013525 | 0.001836 | 0.090864 | 0.000131 | 0.011451 |   98.9 |
| |0>>>                           |_hsaKmtWaitOnEvent                                                                      |      3 |     14 | sampling_wall_clock | sec    |  0.033996 | 0.011332 | 0.010028 | 0.011991 | 0.000001 | 0.001130 |    0.0 |
| |0>>>                             |_hsaKmtWaitOnMultipleEvents                                                           |      3 |     15 | sampling_wall_clock | sec    |  0.033996 | 0.011332 | 0.010028 | 0.011991 | 0.000001 | 0.001130 |    0.0 |
| |0>>>                               |_kmtIoctl                                                                           |      3 |     16 | sampling_wall_clock | sec    |  0.033996 | 0.011332 | 0.010028 | 0.011991 | 0.000001 | 0.001130 |    0.0 |
| |0>>>                                 |_pthread_sigmask                                                                  |      3 |     17 | sampling_wall_clock | sec    |  0.033996 | 0.011332 | 0.010028 | 0.011991 | 0.000001 | 0.001130 |  100.0 |
jmellorcrummey commented 2 years ago

@jrmadsen We agree with your diagnosis. i meant that any thread calling hsa_signal_wait_sacquire is vulnerable to a deadlock. We also believe that the deadlocks arise due to problems with a signal interrupting the ioctl that it calls. We are aware of the flag to block interrupts. We have been using this for 2.5 years when profiling HIP programs with roctracer. I ran into the problem profiling programs using OpenMP offloading through HSA, which didn’t go through the path that set the environment variable to disable HSA interrupts.

We’re glad that you are looking into this.

jmellorcrummey commented 2 years ago

Why is this issue marked as closed? ROCm still hangs when interrupted with a Linux REALTIME timer?

skyreflectedinmirrors commented 2 years ago

@jmellorcrummey -- just as an update, I have been trying to reproduce this with both the examples here, but haven't had much luck. I did find another issue in initializing roctracer that I've reported internally:

void start_tracing();
int main() {
start_tracing();
}
#include <roctracer/roctracer_hip.h>
void api_callback(uint32_t domain, uint32_t cid, const void *callback_data,
void *arg) {
const hip_api_data_t *data =
reinterpret_cast<const hip_api_data_t *>(callback_data);
}
void start_tracing() { roctracer_enable_callback(api_callback, NULL); }

Still working on it.

jmellorcrummey commented 2 years ago

the way i saw this recently was using hpctoolkit’s hpcrun to profile miniqmc. perhaps you can reproduce it with linux perf sampling with realtime.

otherwise we could write a little preloaded library that does nothing but set up realtime sampling on the main thread and wrap pthread_create to start sampling for every other thread. then preload the library and launch miniqmc.

skyreflectedinmirrors commented 2 years ago

perhaps you can reproduce it with linux perf sampling with realtime.

That's a good idea -- I had been trying to build Omnitrace and use that, but ran into fun build issues on our cluster. I'll try that next. Have you been triggering this on Crusher? I doubt it depends on the system, but thought I'd check.

jmellorcrummey commented 2 years ago

i haven’t tried this one on crusher. i was seeing it on our local system with mi50 and rocm 5.1. @jrmadsen saw this problem on amd systems, though i don’t know what flavor rocm and gpus. i think he was using omnitrace.

jrmadsen commented 2 years ago

perhaps you can reproduce it with linux perf sampling with realtime.

That's a good idea -- I had been trying to build Omnitrace and use that, but ran into fun build issues on our cluster. I'll try that next. Have you been triggering this on Crusher? I doubt it depends on the system, but thought I'd check.

@arghdos You shouldn't have to build it. You should just be able to use this installer and then:

export HSA_ENABLE_INTERRUPT=1
export OMNITRACE_ENABLE_SAMPLING=ON
export OMNITRACE_SAMPLING_FREQ=500

If you haven't been seeing it with omnitrace it's bc I do setenv("HSA_ENABLE_INTERRUPT", "0", 0) when the library loads but I don't override it bc there's nothing wrong if it is 1 when solely using the binary instrumentation

skyreflectedinmirrors commented 2 years ago

Didn't look close enough to see you didn't override it if set in the env :)