ROCm / rocprofiler

ROC profiler library. Profiling with perf-counters and derived metrics.
https://rocm.docs.amd.com/projects/rocprofiler/en/latest/
Other
116 stars 44 forks source link

multi gpu monitoring counter inconsistencies #92

Closed gcongiu closed 1 year ago

gcongiu commented 2 years ago

I am observing some inconsistency in the value of the read counters when using rocprofiler through the ROCm component in PAPI. The branch used to reproduce the problem is: https://bitbucket.org/icl/papi/pull-requests/251. Following is an example:

components/rocm/tests/intercept_multi_thread_monitoring : multi GPU activity monitoring program.
[tid:0] rocm:::SQ_INSTS_VALU:device=0 : 82892554240
[tid:0] rocm:::SQ_INSTS_SALU:device=0 : 17257988096
[tid:0] rocm:::SQ_WAVES:device=0 : 262144
[tid:1] rocm:::SQ_INSTS_VALU:device=1 : 82892554240
[tid:1] rocm:::SQ_INSTS_SALU:device=1 : 17257988096
[tid:1] rocm:::SQ_WAVES:device=1 : 262144
PASSED

In intercept mode every kernel runs in isolation and I would expect all readings on every GPU to be exactly the same as reported in the test above. The test is a matrix to matrix multiplication kernel. Each matrix has 4094 rows and 4096 columns. Running on MI250X (which have a wave size of 64) this means 262144 waves in total for each device (as each device runs the same matmul kernel). However, running the same test again I get the following:

components/rocm/tests/intercept_multi_thread_monitoring : multi GPU activity monitoring program.
[tid:1] rocm:::SQ_INSTS_VALU:device=1 : 82892574368
[tid:1] rocm:::SQ_INSTS_SALU:device=1 : 17258050404
[tid:1] rocm:::SQ_WAVES:device=1 : 262292
FAILED!!!
Line # 148 Error in match_expected_counter: Invalid argument
Some tests require special hardware, permissions, OS, compilers
or library versions. PAPI may still function perfectly on your
system without the particular feature being tested here.
[tid:0] rocm:::SQ_INSTS_VALU:device=0 : 82892554240
[tid:0] rocm:::SQ_INSTS_SALU:device=0 : 17257988096
[tid:0] rocm:::SQ_WAVES:device=0 : 262144
srun: error: crusher104: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=106680.0

In the second run, device number 1 records a different value of the counters.

The code I am using for the tests above is: https://bitbucket.org/congiu/papi/src/c2fd41dbbc36/src/components/rocm/tests/multi_thread_monitoring.cpp

gcongiu commented 1 year ago

This is not an issue actually. The waves can be scheduled out before they are completed and then scheduled back in. In this case, the number of waves goes up compared with the number of waves expected.