ROCm / rocprofiler

ROC profiler library. Profiling with perf-counters and derived metrics.
https://rocm.docs.amd.com/projects/rocprofiler/en/latest/
MIT License
126 stars 46 forks source link

ROCr error when running papi_command_line with specific event #132

Open gcongiu opened 10 months ago

gcongiu commented 10 months ago

I am observing a Memory access fault with ROCm-5.7.1 and papi_command_line:

$ rocgdb
GNU gdb (rocm-rel-5.7-98) 13.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
(gdb) file utils/papi_command_line
Reading symbols from utils/papi_command_line...
(gdb) set args rocm:::TA_BUSY_avr:device=0
(gdb) r
This utility lets you add events from the command line interface to see if they work.

[New Thread 0x7ffff7617640 (LWP 4121665)]
[New Thread 0x7ffef6c4c640 (LWP 4121666)]
[Thread 0x7ffef6c4c640 (LWP 4121666) exited]
Successfully added: rocm:::TA_BUSY_avr:device=0

rocm:::TA_BUSY_avr:device=0 :   0

----------------------------------
Memory access fault by GPU node-2 (Agent handle: 0x472b7f0) on address 0x7ffef63ee000. Reason: Unknown.

Thread 2 received signal SIGABRT, Aborted.
[Switching to thread 2 (Thread 0x7ffff7617640 (LWP 4121665))]
0x00007ffff7e4b54c in __pthread_kill_implementation () from /lib64/libc.so.6

(gdb) bt
#0  0x00007ffff7e4b54c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007ffff7dfed46 in raise () from /lib64/libc.so.6
#2  0x00007ffff7dd27f3 in abort () from /lib64/libc.so.6
#3  0x00007ffff7aad689 in rocr::core::Runtime::VMFaultHandler(long, void*) [clone .cold] () from /opt/rocm-5.7.1/lib/libhsa-runtime64.so
#4  0x00007ffff7af734c in rocr::core::Runtime::AsyncEventsLoop(void*) () from /opt/rocm-5.7.1/lib/libhsa-runtime64.so
#5  0x00007ffff7ab1237 in rocr::os::ThreadTrampoline(void*) () from /opt/rocm-5.7.1/lib/libhsa-runtime64.so
#6  0x00007ffff7e49802 in start_thread () from /lib64/libc.so.6
#7  0x00007ffff7de9450 in clone3 () from /lib64/libc.so.6
harkgill-amd commented 1 month ago

Hi @gcongiu, are you still encountering this issue with the latest ROCm 6.2 release? If so, could you please provide more information regarding the error you are seeing?

The steps to reproduce the issue and also a minimal reproducible example would help us further investigate this issue on our side. Thanks!