ROCm / rocprofiler-sdk

https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/
MIT License
14 stars 7 forks source link

[Issue]: Problems with "agent profiling" mode in Rocprofiler-SDK #15

Open adanalis opened 2 months ago

adanalis commented 2 months ago

Problem Description

A) I only get non-zero values for the first event that I have added to the profile.

B) I start two agents for two distinct GPUs, I submit my kernel on only one GPU, but I get the same measurements from both agents.

C) When I get the measurements I have no way of distinguishing which measurement came from which agent.

D) When using watermark equal to zero, the buffer callback is triggered as soon as there is one entry in the buffer, but before all the entries have been in the buffer. As a result we see the entries "out of order." We would like the data to be accessible synchronously when we get a sample without having to go through buffers.

Operating System

Rocky Linux 9.4 (Blue Onyx)

CPU

AMD EPYC 7413 24-Core Processor

GPU

AMD Instinct MI210

ROCm Version

ROCm 6.2.0

ROCm Component

rocprofiler

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

bwelton commented 1 month ago

I suspect A/B may be related. Can you post the code where you call rocprofiler_configure_agent_profile_counting_service?

C has an internal patch that resolves this issue that should be published shortly. D has a patch in the works that should be available soon.

adanalis commented 1 month ago

In addition to the problems discussed above, I'm now getting a segfault inside rocprof-sdk code. I created a PR in the PAPI repo that enables the agent profiling mode and comes with tests. The PR is here: https://github.com/icl-utk-edu/papi/pull/249

To reproduce the segfault please do the following:

1) clone PAPI, go into the directory "$papi_root/src" and run ./configure --with-components=rocp_sdk

2) run make

3) export RPSDK_MODE_AGENT_PROFILE=1

4) go to $papi_root/src/components/rocp_sdk/tests

5) run ./advanced

Here is the backtrace from my runs:

0 0x00007fffebc3d819 in rocr::HSA::hsa_signal_store_relaxed(hsa_signal_s, long) ()

from /apps/rocm/rocm-6.3afar6/lib/llvm/bin/../../../lib/libhsa-runtime64.so.1

1 0x00007fffeb87648e in rocprofiler::counters::read_agent_ctx(rocprofiler::context::context const*, rocprofiler_user_data_t, rocprofiler_counter_flag_t) () from /apps/rocm/rocm-6.3afar6/lib/librocprofiler-sdk.so

2 0x00000000004a90d7 in papi_rocpsdk::read_sample () at components/rocp_sdk/sdk_class.cpp:632

3 0x00000000004a9f7d in rocprofiler_sdk_ctx_read (ctx=0xc18410, counters=0x7fffffff6ac8) at components/rocp_sdk/sdk_class.cpp:1110

4 0x000000000047d23f in _papi_hwi_read (context=, ESI=ESI@entry=0x655710, values=values@entry=0x7fffffff6b90)

at papi_internal.c:1713

5 0x000000000047866c in PAPI_read (EventSet=, values=0x7fffffff6b90) at papi.c:3127

6 0x0000000000476e4f in main ()