ROCm / rocprofiler

ROC profiler library. Profiling with perf-counters and derived metrics.
https://rocm.docs.amd.com/projects/rocprofiler/en/latest/
MIT License
126 stars 46 forks source link

Memory access fault by GPU node-i #80

Open gcongiu opened 2 years ago

gcongiu commented 2 years ago

HIP application crashes with error:

Memory access fault by GPU node-3 (Agent handle: 0x3c7610) on address 0x155555368000. Reason: Page not present or supervisor privilege.

whenever events from different devices are monitored at the same time. The problem can be reproduced using this simple rocprofiler program rocprofiler-test.tar.gz

This problem was produced with rocm-4.5.0 and rocm-5.0.0 on MI100 GPUs.

kikimych commented 2 years ago

Same issue as https://github.com/ROCm-Developer-Tools/rocprofiler/issues/66.

gcongiu commented 2 years ago

@kikimych how is issue #66 related to this one? It looks like that issue is related to intercept mode, while this issue is related to sample mode.

kikimych commented 2 years ago

Same problem with infinite recursion. Both issues would be fixed by same commit.

gcongiu commented 2 years ago

A PAPI user recently reported an issue that seems to be related to this one: https://bitbucket.org/icl/papi/issues/110/rocm-component-papi_stop-memory-access

jrodgers-github commented 2 years ago

Capturing notes/findings from https://bitbucket.org/icl/papi/issues/110/rocm-component-papi_stop-memory-access:

See complementing PAPI ticket for full list of metrics that encounter the error.

marklawsAMD commented 1 year ago

This issue is fixed in the staging branch of rocprof and will be part of ROCm 5.7. If you'd like to test the fix, let me know here and I'll attach a patch that will apply cleanly to the current github version; otherwise, please try when ROCm 5.7 is released. Thanks!

gcongiu commented 1 year ago

Hi @marklawsAMD, that sounds good. Please send the patch over and I will test it with PAPI ASAP.

marklawsAMD commented 1 year ago

Hi @marklawsAMD, that sounds good. Please send the patch over and I will test it with PAPI ASAP.

github-rocprofiler-80.patch

If that doesn't solve it, please let me know; this is one of a number of issues for which I'm trying to get in fixes before the branch.

bertwesarg commented 1 year ago

I can still reproduce https://github.com/gcongiu/rocm-issues/tree/main/issue-80 this issue with ROCm-5.7.0 build 19 on a dual MI210 node. I will now update to 5.7.0 build 36 and test again.

Memory access fault by GPU node-8 (Agent handle: 0x33c2f0) on address 0x7fffeb572000. Reason: Unknown.

Thread 2 "issue-80" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fffebd84700 (LWP 1858869)]
0x00007ffff5143a9f in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: yum debuginfo-install comgr-2.5.0.50700-19.el8.x86_64 elfutils-libelf-0.186-1.el8.x86_64 glibc-2.28-189.5.el8_6.x86_64 hip-runtime-amd-
5.7.31921.50700-19.el8.x86_64 hsa-rocr-1.11.0.50700-19.el8.x86_64 libzstd-1.4.4-1.el8.x86_64 ncurses-libs-6.1-9.20180224.el8.x86_64 numactl-libs-2.0.12-13.el8.x86_64 roc
profiler-2.0.0.50700-19.el8.x86_64
(gdb) bt
#0  0x00007ffff5143a9f in raise () from /lib64/libc.so.6
#1  0x00007ffff5116e05 in abort () from /lib64/libc.so.6
#2  0x00007ffff794b30a in rocr::core::Runtime::VMFaultHandler(long, void*) [clone .cold.743] () from /opt/rocm-5.7.0/lib/libhsa-runtime64.so.1
#3  0x00007ffff79923c4 in rocr::core::Runtime::AsyncEventsLoop(void*) () from /opt/rocm-5.7.0/lib/libhsa-runtime64.so.1
#4  0x00007ffff794ef07 in rocr::os::ThreadTrampoline(void*) () from /opt/rocm-5.7.0/lib/libhsa-runtime64.so.1
#5  0x00007ffff54c21cf in start_thread () from /lib64/libpthread.so.0
#6  0x00007ffff512edd3 in clone () from /lib64/libc.so.6
(gdb) info threads
  Id   Target Id                                      Frame 
  1    Thread 0x7ffff7e94280 (LWP 1858865) "issue-80" 0x00007ffff512e91d in syscall () from /lib64/libc.so.6
* 2    Thread 0x7fffebd84700 (LWP 1858869) "issue-80" 0x00007ffff5143a9f in raise () from /lib64/libc.so.6
(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7e94280 (LWP 1858865))]
#0  0x00007ffff512e91d in syscall () from /lib64/libc.so.6
(gdb) bt
#0  0x00007ffff512e91d in syscall () from /lib64/libc.so.6
#1  0x00007ffff7f994bd in KMPNativeAffinity::Mask::set_system_affinity(bool) const () from /opt/rocm-5.7.0/llvm/lib/libomp.so
#2  0x00007ffff7fb1051 in __kmp_affinity_bind_thread () from /opt/rocm-5.7.0/llvm/lib/libomp.so
#3  0x00007ffff7f95592 in __kmp_affinity_create_x2apicid_map(kmp_i18n_id*) () from /opt/rocm-5.7.0/llvm/lib/libomp.so
#4  0x00007ffff7f90d9c in __kmp_aux_affinity_initialize(kmp_affinity_t&) () from /opt/rocm-5.7.0/llvm/lib/libomp.so
#5  0x00007ffff7f58bee in __kmp_do_middle_initialize() () from /opt/rocm-5.7.0/llvm/lib/libomp.so
#6  0x00007ffff7f4fb78 in __kmp_parallel_initialize () from /opt/rocm-5.7.0/llvm/lib/libomp.so
#7  0x00007ffff7f51c7e in __kmp_fork_call () from /opt/rocm-5.7.0/llvm/lib/libomp.so
#8  0x00007ffff7f442a5 in __kmpc_fork_call () from /opt/rocm-5.7.0/llvm/lib/libomp.so
#9  0x000000000020b6d5 in main () at sampling.cpp:164
bertwesarg commented 1 year ago

I got ROCm 5.7.0 build 48, but the error remains:

Memory access fault by GPU node-8 (Agent handle: 0x1db02f0) on address 0x7fa08547c000. Reason: Unknown.
Aborted (core dumped)
vlaindic commented 1 year ago

@bertwesarg @gcongiu Thank you once again for reporting this.

After investigation and cooperation with @marklawsAMD , we noticed that the reproducer instantiates a single AQLProfile queue shared among multiple devices, that leads to the described error. After adapting the reproducer to always creates an AQLProfile queue per gpu agent (passing the ROCPROFILER_MODE_CREATEQUEUE mask), the main function finished properly.

However, sometimes the reproducer might crash at the very end of the program (during the process shutdown).

Thread 2 "issue-80" hit Breakpoint 1, rocr::core::Runtime::VMFaultHandler (val=0, arg=0x314f60) at /work/git/ROCR-Runtime/src/core/runtime/runtime.cpp:1348
1348        assert(false && "GPU memory access fault.");
(gdb) info thread
  Id   Target Id                                    Frame
  1    Thread 0x7fffed01eac0 (LWP 17490) "issue-80" __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0,
    op=265, expected=17493, futex_word=0x7ffde75fd990) at ./nptl/futex-internal.c:57
* 2    Thread 0x7fffeceb7640 (LWP 17491) "issue-80" rocr::core::Runtime::VMFaultHandler (val=0, arg=0x314f60)
    at /work/git/ROCR-Runtime/src/core/runtime/runtime.cpp:1348
  4    Thread 0x7ffde75fd6c0 (LWP 17493) "issue-80" malloc_consolidate (av=av@entry=0x7ffff5cfcc80 <main_arena>)
    at ./malloc/malloc.c:4754

(gdb) thread 1
[Switching to thread 1 (Thread 0x7fffed01eac0 (LWP 17490))]
#0  __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=17493, futex_word=0x7ffde75fd990)
    at ./nptl/futex-internal.c:57
57      ./nptl/futex-internal.c: No such file or directory.
(gdb) where
#0  __futex_abstimed_wait_common64 (private=128, cancel=true, abstime=0x0, op=265, expected=17493, futex_word=0x7ffde75fd990)
    at ./nptl/futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=128, abstime=0x0, clockid=0, expected=17493, futex_word=0x7ffde75fd990)
    at ./nptl/futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x7ffde75fd990, expected=17493,
    clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=private@entry=128) at ./nptl/futex-internal.c:139
#3  0x00007ffff5b796a4 in __pthread_clockjoin_ex (threadid=140728485271232, thread_return=0x7fffffffdfc0, clockid=0,
    abstime=0x0, block=<optimized out>) at ./nptl/pthread_join_common.c:105
#4  0x00007ffff5de920d in __kmp_reap_worker () from /opt/rocm-5.6.0/llvm/lib/libomp.so
#5  0x00007ffff5d929b9 in __kmp_reap_thread(kmp_info*, int) () from /opt/rocm-5.6.0/llvm/lib/libomp.so
#6  0x00007ffff5d8f56d in __kmp_internal_end() () from /opt/rocm-5.6.0/llvm/lib/libomp.so
#7  0x00007ffff5d8f3c4 in __kmp_internal_end_library () from /opt/rocm-5.6.0/llvm/lib/libomp.so
#8  0x00007ffff7fc924e in _dl_fini () at ./elf/dl-fini.c:142
#9  0x00007ffff5b28495 in __run_exit_handlers (status=0, listp=0x7ffff5cfc838 <__exit_funcs>,
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
#10 0x00007ffff5b28610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#11 0x00007ffff5b0cd97 in __libc_start_call_main (main=main@entry=0x215b10 <main()>, argc=argc@entry=1,
    argv=argv@entry=0x7fffffffe2e8) at ../sysdeps/nptl/libc_start_call_main.h:74
#12 0x00007ffff5b0ce40 in __libc_start_main_impl (main=0x215b10 <main()>, argc=1, argv=0x7fffffffe2e8, init=<optimized out>,
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe2d8) at ../csu/libc-start.c:392
#13 0x00000000002159b5 in _start ()

[Switching to thread 2 (Thread 0x7fffeceb7640 (LWP 17491))]
#0  rocr::core::Runtime::VMFaultHandler (val=0, arg=0x314f60) at /work/git/ROCR-Runtime/src/core/runtime/runtime.cpp:1348
1348        assert(false && "GPU memory access fault.");
(gdb) where
#0  rocr::core::Runtime::VMFaultHandler (val=0, arg=0x314f60) at /work/git/ROCR-Runtime/src/core/runtime/runtime.cpp:1348
#1  0x00007ffff7c9d2f1 in rocr::core::Runtime::AsyncEventsLoop () at /work/git/ROCR-Runtime/src/core/runtime/runtime.cpp:1131
#2  0x00007ffff7bf6c8f in rocr::os::ThreadTrampoline (arg=0x33fa80) at /work/git/ROCR-Runtime/src/core/util/lnx/os_linux.cpp:78
#3  0x00007ffff5b77b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#4  0x00007ffff5c08bb4 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

(gdb) thread 4
[Switching to thread 4 (Thread 0x7ffde75fd6c0 (LWP 17493))]
#0  malloc_consolidate (av=av@entry=0x7ffff5cfcc80 <main_arena>) at ./malloc/malloc.c:4754
4754    ./malloc/malloc.c: No such file or directory.
(gdb) where
#0  malloc_consolidate (av=av@entry=0x7ffff5cfcc80 <main_arena>) at ./malloc/malloc.c:4754
#1  0x00007ffff5b85f20 in _int_free (av=0x7ffff5cfcc80 <main_arena>, p=0x9cfd80, have_lock=<optimized out>)
    at ./malloc/malloc.c:4674
#2  0x00007ffff5b8872d in __GI___libc_free (mem=<optimized out>) at ./malloc/malloc.c:3391
#3  tcache_thread_shutdown () at ./malloc/malloc.c:3227
#4  __malloc_arena_thread_freeres () at ./malloc/arena.c:1003
#5  0x00007ffff5b8b24a in __libc_thread_freeres () at ./malloc/thread-freeres.c:44
#6  0x00007ffff5b779cf in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:456
#7  0x00007ffff5c08bb4 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

Mark and I will continue working on this issue, but we cannot provide the exact timeline about its resolution.

Thanks for understanding.

vlaindic commented 11 months ago

@gcongiu @bertwesarg A brief update. After changing the reproducer to use single thread to launch kernels sequentially, creating a single queue per device, and moving the initialization of the rocprofiler_properties_t inside this loop, everything seems to work fine on the system with multiple MI210s. However, when using more than one thread (created either directly or via OpenMP) to launch kernels concurrently, the program may still crash at the very end of the process. @bgopesh and I are still trying to figure out what causes the issue in the multi-threaded environment.

gcongiu commented 11 months ago

@vlaindic thank you for the update. If it is of any help I could not see the failure at the end when using ROCm 5.2.0. Maybe the problem was introduced in later versions of the toolkit?

bertwesarg commented 11 months ago

My tests did not yet include launching kernels from multiple threads into the same queue. I could extend my own mini test to do this for sure and run it on our ROCm 5.7.1 RC setup.

vlaindic commented 11 months ago

@bertwesarg BTW, 5.7.1. is officially out in case you would like to update your setup. Here is the link.

ppanchad-amd commented 1 month ago

@gcongiu Can you please check if your issue still exists in the latest ROCm 6.2? If resolved, please close the ticket. Thanks!