ROCm / rocprofiler

ROC profiler library. Profiling with perf-counters and derived metrics.
https://rocm.docs.amd.com/projects/rocprofiler/en/latest/
Other
116 stars 44 forks source link

PAPI rocm_r component segfaults in intercept mode #74

Closed gcongiu closed 2 years ago

gcongiu commented 2 years ago

Testing PAPI rocm_r component (https://bitbucket.org/congiu/papi/branch/2022.01.11_rocm-rewrite) with the code at this link: https://bitbucket.org/congiu/papi/src/b9533e4c207f20d0477174d097bec2df73867f02/src/components/rocm_r/tests/hip_matmul_single_gpu.cpp

on MI100 GPUs with rocm-4.5.0 and rocm-5.0.0 generates the behaviour following reported.

Following is the kernel running with PAPI rocm_r component in sample mode

$ ./hip_matmul_single_gpu
./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192
First kernel run...
rocm:::SQ_INSTS_VALU:device=0 : 77329334272
rocm:::SQ_INSTS_SALU:device=0 : 17188257792
rocm:::SQ_WAVES:device=0 : 1048576
Second kernel run...
rocm:::SQ_INSTS_VMEM_RD:device=0 : 17179869184
rocm:::SQ_INSTS_VMEM_WR:device=0 : 1048576

And with PAPI rocm_r component in intercept mode

$ ROCP_HSA_INTERCEPT=1 ./hip_matmul_single_gpu
./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192
Segmentation fault (core dumped)

Rerunning the above with gdb:

$ ROCP_HSA_INTERCEPT=1 gdb ./hip_matmul_single_gpu
...Starting program: /home/gcongiu/papi/src/components/rocm_r/tests/./hip_matmul_single_gpu
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
/home/gcongiu/papi/src/components/rocm_r/tests/./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192
[New Thread 0x7fffd303d700 (LWP 68100)]

Program received signal SIGSEGV, Segmentation fault.
0x00000000000127e0 in ?? ()
Missing separate debuginfos, use: debuginfo-install elfutils-libelf-0.176-5.el7.x86_64 glibc-2.17-325.el7_9.x86_64 libdrm-2.4.97-2.el7.x86_64 libgcc-4.8.5-44.el7.x86_64 libstdc++-4.8.5-44.el7.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 numactl-libs-2.0.12-5.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64
(gdb) bt
#0  0x00000000000127e0 in ?? ()
#1  0x00007fffd30718b0 in ?? () from /opt/rocm-4.5.0/rocprofiler/lib/librocprofiler64.so
#2  0x00007fffd3077d6d in ?? () from /opt/rocm-4.5.0/rocprofiler/lib/librocprofiler64.so
#3  0x00007ffff6885947 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#4  0x00007ffff6899ce5 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#5  0x00007ffff68812ca in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#6  0x00007ffff686faa8 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#7  0x00007ffff68143d1 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#8  0x00007ffff6814af8 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#9  0x00007ffff68155da in hipStreamCreate () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#10 0x0000000000206bce in main ()

Interestingly, if I use MALLOCCHECK=1:

$ MALLOC_CHECK_=1 ROCP_HSA_INTERCEPT=1 ./hip_matmul_single_gpu
./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192
First kernel run...
rocm:::SQ_INSTS_VALU:device=0 : 77329334272
rocm:::SQ_INSTS_SALU:device=0 : 17188257792
rocm:::SQ_WAVES:device=0 : 1048576
Error! Failed starting eventset, error=-8 -> 'Event exists, but cannot be counted due to hardware resource limits’

The segmentation fault disappears. This seems to indicate a memory error in librocprofiler.

Ignore the “Error! …” line. This is generated by PAPI and is due to the fact that the EventSet that initially contained the VALU, SALU and WAVES events has been cleaned up and reused with different events (i.e. VMEM). Since rocprofiler does not allow changing the dispatch callbacks after they have been set PAPI throws an error.

rahulmula commented 2 years ago

Hi I'm trying to build https://bitbucket.org/congiu/papi/src/b9533e4c207f20d0477174d097bec2df73867f02/src/components/rocm_r/tests/hip_matmul_single_gpu.cpp i'm observing below error, pls let me know if you have any instructions on how to build hip_matmul_single_gpu.cpp. Errors: papi/src/components/rocm_r/tests# hipcc -O3 -o test hip_matmul_single_gpu.cpp ld.lld: error: undefined symbol: PAPI_library_init

referenced by hip_matmul_single_gpu.cpp /tmp/hip_matmul_single_gpu-aed288.o:(main)

ld.lld: error: undefined symbol: PAPI_create_eventset

referenced by hip_matmul_single_gpu.cpp /tmp/hip_matmul_single_gpu-aed288.o:(main)

ld.lld: error: undefined symbol: PAPI_add_named_event

referenced by hip_matmul_single_gpu.cpp /tmp/hip_matmul_single_gpu-aed288.o:(main) referenced by hip_matmul_single_gpu.cpp /tmp/hip_matmul_single_gpu-aed288.o:(main) referenced by hip_matmul_single_gpu.cpp /tmp/hip_matmul_single_gpu-aed288.o:(main) referenced 2 more times

ld.lld: error: undefined symbol: PAPI_start

referenced by hip_matmul_single_gpu.cpp /tmp/hip_matmul_single_gpu-aed288.o:(main) referenced by hip_matmul_single_gpu.cpp /tmp/hip_matmul_single_gpu-aed288.o:(main)

ld.lld: error: undefined symbol: PAPI_stop

referenced by hip_matmul_single_gpu.cpp /tmp/hip_matmul_single_gpu-aed288.o:(main) referenced by hip_matmul_single_gpu.cpp /tmp/hip_matmul_single_gpu-aed288.o:(main)

ld.lld: error: undefined symbol: PAPI_cleanup_eventset

referenced by hip_matmul_single_gpu.cpp /tmp/hip_matmul_single_gpu-aed288.o:(main) referenced by hip_matmul_single_gpu.cpp /tmp/hip_matmul_single_gpu-aed288.o:(main)

ld.lld: error: undefined symbol: PAPI_destroy_eventset

referenced by hip_matmul_single_gpu.cpp /tmp/hip_matmul_single_gpu-aed288.o:(main) referenced by hip_matmul_single_gpu.cpp /tmp/hip_matmul_single_gpu-aed288.o:(main)

ld.lld: error: undefined symbol: PAPI_shutdown

referenced by hip_matmul_single_gpu.cpp /tmp/hip_matmul_single_gpu-aed288.o:(main)

ld.lld: error: undefined symbol: PAPI_strerror

referenced by hip_matmul_single_gpu.cpp /tmp/hip_matmul_single_gpu-aed288.o:(main) referenced by hip_matmul_single_gpu.cpp /tmp/hip_matmul_single_gpu-aed288.o:(main) referenced by hip_matmul_single_gpu.cpp /tmp/hip_matmul_single_gpu-aed288.o:(main) referenced 7 more times clang-14: error: linker command failed with exit code 1 (use -v to see invocation)

gcongiu commented 2 years ago

Hi @rahulmula. Sorry for the late reply. Can you try pulling the latest version of the code and try again?

gcongiu commented 2 years ago

@rahulmula you should clone and build papi using the following options:

git clone https://bitbucket.org/congiu/papi.git
cd papi/src
git checkout 2022.01.11_rocm-rewrite
./configure --with-components="rocm" --with-debug=yes && make -j 8

The above will also build the tests in components/rocm/tests. The one you are interested in has been renamed to hip_matmul_single_gpu.cpp.

kikimych commented 2 years ago

Is it ok when PAPI_create_eventset( int *EventSet ) returns null pointer and result status == PAPI_OK?

gcongiu commented 2 years ago

Hi @kikimych, you need to set EventSet = PAPI_NULL before passing it to PAPI_create_eventset. If PAPI returns PAPI_OK you should also get a valid event set handle. If that is not the case can you provide some more information on what you are doing?

ammarwa commented 2 years ago

ROCP_HSA_INTERCEPT=0,1,2

Let me first explain the use of the environment flag, the flag was implemented in order to use the profiler tool usually located in "/opt/rocm/rocprofiler/tool/libtool.so" in order to collect counters for all kernels running and post the output to the terminal screen or optionally in a file.

let me use an example to give more information about its usage:

1- We need first to set some environment variables in order for that feature to work: *if rocprofiler is installed in /opt/rocm then you can replace every "PATHTO" with /opt/rocm export HSA_TOOLS_LIB=PATH_TO_lib/librocprofiler64.so export ROCP_TOOL_LIB=PATH_TO_rocprofiler/tool/libtool.so export ROCP_METRICS=PATH_TO_rocprofiler/lib/metrics.xml export ROCP_INPUT=pmc_input.xml (Example for pmc_input.xml can be found here: https://github.com/ROCm-Developer-Tools/rocprofiler/blob/amd-master/test/tool/pmc_input.xml)

2- Optionally, to store the results in a file: export ROCP_OUTPUT_DIR=./ (That will store the output in a file in the current directory)

3- export ROCP_HSA_INTERCEPT=0 OR 1 OR 2 0 means interception is disabled 1 means interception is enabled without timestamps 2 means interception is enabled with timestamps

4- Finally, you can run the application normally, for example, ./MatrixTranspose

And these steps are the steps taken by the rocprof script to run rocprofiler on any application to collect counters for that specific application.

So, I don't think this is the functionality needed here, since, you are already collecting counters as shown here:

$ ./hip_matmul_single_gpu ./hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192 First kernel run... rocm:::SQ_INSTS_VALU:device=0 : 77329334272 rocm:::SQ_INSTS_SALU:device=0 : 17188257792 rocm:::SQ_WAVES:device=0 : 1048576 Second kernel run... rocm:::SQ_INSTS_VMEM_RD:device=0 : 17179869184 rocm:::SQ_INSTS_VMEM_WR:device=0 : 1048576

gcongiu commented 2 years ago

Hi @ammarwa, right. The tool load mechanism in hsa_init is useful when you have an application that does not contain any performance measurement instrumentation but you still want to profile its behaviour. The only application's entry point for rocprofiler to do that is hsa_init, through its tool load mechanism. Thus, the loaded tool defines the metrics and the callbacks for hsa so that this can hook them to the dispatch and completion queues.

In our case, however, we have applications instrumented with PAPI calls (either manually by the user or automatically by tools, e.g. TAU). What we want is let PAPI users access per-kernel information through the intercept mode of rocprofiler. From the PAPI user point of view the way rocprofiler intercept mode is initialized (and works) is irrelevant. The users only know that to get per-kernel performance counters they need to set ROCP_HSA_INTERCEPT=1 and call PAPI_start/PAPI_stop/PAPI_read as they would normally do for sample mode.

kikimych commented 2 years ago

I'm checking value of eventset after https://bitbucket.org/congiu/papi/src/b9533e4c207f20d0477174d097bec2df73867f02/src/components/rocm_r/tests/hip_matmul_single_gpu.cpp#lines-36

(gdb) r Starting program: /home/user/workspace/papi/src/components/rocm_r/tests/hip_matmul_single_gpu [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". /home/user/workspace/papi/src/components/rocm_r/tests/hip_matmul_single_gpu : Multiply two square matrices of size 8192 x 8192

Breakpoint 1, PAPI_create_eventset (EventSet=0x7fffffffdc6c) at papi.c:2020 2020 { (gdb) finish Run till exit from #0 PAPI_create_eventset (EventSet=0x7fffffffdc6c) at papi.c:2020 0x0000000000217d8d in main (argc=1, argv=0x7fffffffdd98) at hip_matmul_single_gpu.cpp:43 43 papi_errno = PAPI_create_eventset(&eventset); Value returned is $1 = 0 p eventset $2 = 0

Same behaviour with unset value of ROCP_HSA_INTERCEPT

gcongiu commented 2 years ago

@kikimych that is a valid handle value for eventset.

kikimych commented 2 years ago

@kikimych that is a valid handle value for eventset.

I have an issue with restarting program in hipStreamDestroy. That's what i see in debugger: (gdb) p callbacks_ $1 = {dispatch = 0x7ffff5d6d260 <dispatch_callback_cb>, create = 0x7fffffffd9b0, destroy = 0x217bd0 <_start>}

callbacks_.destroy is set to main accidentally

Could you please check this fix https://github.com/ROCm-Developer-Tools/rocprofiler/pull/82 ?

gcongiu commented 2 years ago

@kikimych I have tried your patch. It still segfaults. Following my gdb session output:

Program received signal SIGSEGV, Segmentation fault.
0x000000000001acf0 in ?? ()
Missing separate debuginfos, use: debuginfo-install elfutils-libelf-0.176-5.el7.x86_64 glibc-2.17-325.el7_9.x86_64 libdrm-2.4.97-2.el7.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 numactl-libs-2.0.12-5.el7.x86_64 zlib-1.2.7-19.el7_9.x86_64
(gdb) bt
#0  0x000000000001acf0 in ?? ()
#1  0x00007fffd2fb0325 in rocprofiler::InterceptQueue::InterceptQueueCreate (agent=..., size=1024, type=0, callback=0x7ffff687fb90,
    data=0x5323f0, private_segment_size=4294967295, group_segment_size=4294967295, queue=0x7fffffffb6f8, tracker_on=@0x7fffffffb6af: false)
    at /home/gcongiu/ROCm/rocprofiler/src/core/intercept_queue.h:129
#2  0x00007fffd2fd8cd9 in rocprofiler::InterceptQueue::QueueCreate (agent=..., size=1024, type=0, callback=0x7ffff687fb90, data=0x5323f0,
    private_segment_size=4294967295, group_segment_size=4294967295, queue=0x7fffffffb6f8)
    at /home/gcongiu/ROCm/rocprofiler/src/core/intercept_queue.h:141
#3  0x00007ffff6884947 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#4  0x00007ffff6898ce5 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#5  0x00007ffff68802ca in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#6  0x00007ffff686eaa8 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#7  0x00007ffff68133d1 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#8  0x00007ffff6813af8 in ?? () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#9  0x00007ffff68145da in hipStreamCreate () from /opt/rocm-4.5.0/hip/lib/libamdhip64.so.4
#10 0x0000000000218433 in main (argc=1, argv=0x7fffffffbe88) at hip_test_single_kernel.cpp:72
(gdb) up
#1  0x00007fffd2fb0325 in rocprofiler::InterceptQueue::InterceptQueueCreate (agent=..., size=1024, type=0, callback=0x7ffff687fb90,
    data=0x5323f0, private_segment_size=4294967295, group_segment_size=4294967295, queue=0x7fffffffb6f8, tracker_on=@0x7fffffffb6af: false)
    at /home/gcongiu/ROCm/rocprofiler/src/core/intercept_queue.h:129
129       status = callbacks_.create(*queue, callback_data_);
(gdb) print callbacks_
$1 = {dispatch = 0x7ffff5f7a35e <dispatch_callback_cb>, create = 0x1acf0, destroy = 0x18}

It looks like callbacks_ is not correctly initialized.

kikimych commented 2 years ago

I have traced it a little bit. So what happens on my machine:

Looks like aslr is disabled somewhere and I have repeating stack segment address from run to run.

Thread 1 "hip_matmul_sing" hit Hardware watchpoint 13: *0x7fffffffd950`

Old value = 3 New value = 2194384 0x00007fffdade6061 in __libc_calloc (n=3, elem_size=8) at malloc.c:3376 3376 malloc.c: No such file or directory. (gdb) bt

0 0x00007fffdade6061 in __libc_calloc (n=3, elem_size=8) at malloc.c:3376

1 0x00007ffff5d6c993 in intercept_ctx_open (ntv_table=0x7ffff5fb1150 , events_id=0x48e160, num_events=3, rocp_ctx=0x486138) at components/rocm/rocp.c:924

2 0x00007ffff5d6bdfe in rocp_ctx_open (ntv_table=0x7ffff5fb1150 , events_id=0x48e160, num_events=3, rocp_ctx=0x486138) at components/rocm/rocp.c:612

3 0x00007ffff5d6a0ae in rocm_start (ctx=0x3bed70, ctl=0x486110) at components/rocm/rocm.c:428

4 0x00007ffff5d05989 in PAPI_start (EventSet=0) at papi.c:2799

5 0x00000000002180f7 in main (argc=1, argv=0x7fffffffdd98) at hip_matmul_single_gpu.cpp:76

Memory address on stack 0x7fffffffd950 is set to _start by standart library calloc function: (gdb) x/1g *0x7fffffffd950 0x217bd0 <_start>: 0x89485ed18949ed31

later in init callback: https://bitbucket.org/congiu/papi/src/b9533e4c207f20d0477174d097bec2df73867f02/src/components/rocm_r/rocp.c#lines-1035

(gdb) x/3g &dispatch_cb 0x7fffffffd940: 0x00000000004fc420 0x00007fffffffd9b0 0x7fffffffd950: 0x0000000000217bd0

dispatch_cb is a structure allocated on stack. You reset value of dispatch_cb.dispatch, but value of dispatch_cb.destroy and dispatch_cb.create fields are inherrited from intercept_ctx_open() frame.

Here: https://github.com/ROCm-Developer-Tools/rocprofiler/blob/amd-master/src/core/intercept_queue.h#L521 this structure is copied to rocm internal intercept queue. When I build papi as is program segfaults on first call of dispatch_cb.create callback, when i initialize dispatch_cb.create but forget to initialize dispatch_cb.destroy program restarts when context pool is destroyed in hipDestroyStream and this repeats in infinite loop.

Could you please initialize all three callback values or set unused ones to NULL and check?

gcongiu commented 2 years ago

Hi @kikimych, I moved the dispatch_cb to global scope and now it works. Thanks for the clarification.

gcongiu commented 2 years ago

@kikimych is this example code also buggy? https://github.com/ROCm-Developer-Tools/rocprofiler/blob/06b07a5e1dd00d1dbbfefa585d1769d8c30f3e28/test/app/intercept_test.cpp#L295

kikimych commented 2 years ago

Brace initialization sets all fields to zero by default, so it's safe.

gcongiu commented 2 years ago

Right but callbacks_ptrs is still allocated on the stack. When that goes out of context the callbacks value will be overwritten by whatever is put on the stack and potentially cause problems. Am I getting this wrong?

kikimych commented 2 years ago

Sure, but callbacks is automatic variable and can't be used after leaving scope, compiler ensures it.

gcongiu commented 2 years ago

Ok then I think this issue can be closed. Thanks a lot for the help @kikimych.