ROCm / rocprofiler-compute

Advanced Profiling and Analytics for AMD Hardware
https://rocm.docs.amd.com/projects/omniperf/en/latest/
MIT License
135 stars 49 forks source link

Reduce number of calls to rocprof #384

Closed benrichard-amd closed 3 months ago

benrichard-amd commented 3 months ago

Reduces number of calls to rocprof by improving perfmon coalescing.

coleramos425 commented 3 months ago

I reviewed the PR and in general, things look good. The one thing that I wanted to review closely was the Omniperf metrics that are explicitly dependent on SQ_*.csv (indicated by the coll_level field in yaml configs) to ensure results are still matching original implementation, i.e. image See: Code search results (github.com)

Things line up for the most part however when contrasting the two I find that INSTR_FETCH_LATENCY and SMEM_LATENCY differ. It seems to me that counter <-> output file mapping is staying the same in which case I think we can narrow this down to run-to-run variation. @benrichard-amd could you confirm my assumption? My results are below:

v1 (original code vs. original code)

omniperf analyze -p workloads/orig_run1/MI300A/ -p workloads/orig_run2/MI300A/ -b 11.2.10 13.2.5

v1_v1

v2 (original code vs. ben's modification)

omniperf analyze -p workloads/orig_run1/MI300A/ -p workloads/bens_run1/MI300A/ -b 11.2.10 13.2.5

v1_v2

coleramos425 commented 3 months ago

Other than the above inquiry my only other request before we merge this is that you sign off on commits per DCO requirements, e.g. image

benrichard-amd commented 3 months ago

Hi @coleramos425,

Things line up for the most part however when contrasting the two I find that INSTR_FETCH_LATENCY and SMEM_LATENCY differ. It seems to me that counter <-> output file mapping is staying the same in which case I think we can narrow this down to run-to-run variation. @benrichard-amd could you confirm my assumption? My results are below:

I think this is run-to-run variation. Might depend on the workload. I ran the same experiment using the occupancy.hip sample workload (MI300X):

original code vs original code: occupancy_origin

original code vs ben's modification: occupancy_ben

coleramos425 commented 3 months ago

LGTM

coleramos425 commented 3 months ago

For the record, in the event of a vanilla Omniperf profiling run (e.g. no IP block filtering) this PR reduces the num. of required application replays from 24 -> 15. In the study of mixbench profiling performance (below), I found this leads to a ~27% improvement.

Note: This test was ran using a production rocprofiler build. We can expect an even larger improvement when this is applied in combination with profiler performance enhancements in a future release.

Runtime (sec) # of req. application replays
Original code 32.28 24
This PR 23.67 15

CC: @koomie