Closed benrichard-amd closed 3 months ago
I reviewed the PR and in general, things look good. The one thing that I wanted to review closely was the Omniperf metrics that are explicitly dependent on SQ_*.csv (indicated by the coll_level
field in yaml configs) to ensure results are still matching original implementation, i.e.
See: Code search results (github.com)
Things line up for the most part however when contrasting the two I find that INSTR_FETCH_LATENCY
and SMEM_LATENCY
differ. It seems to me that counter <-> output file mapping is staying the same in which case I think we can narrow this down to run-to-run variation. @benrichard-amd could you confirm my assumption? My results are below:
omniperf analyze -p workloads/orig_run1/MI300A/ -p workloads/orig_run2/MI300A/ -b 11.2.10 13.2.5
omniperf analyze -p workloads/orig_run1/MI300A/ -p workloads/bens_run1/MI300A/ -b 11.2.10 13.2.5
Other than the above inquiry my only other request before we merge this is that you sign off on commits per DCO requirements, e.g.
Hi @coleramos425,
Things line up for the most part however when contrasting the two I find that INSTR_FETCH_LATENCY and SMEM_LATENCY differ. It seems to me that counter <-> output file mapping is staying the same in which case I think we can narrow this down to run-to-run variation. @benrichard-amd could you confirm my assumption? My results are below:
I think this is run-to-run variation. Might depend on the workload. I ran the same experiment using the occupancy.hip sample workload (MI300X):
original code vs original code:
original code vs ben's modification:
LGTM
For the record, in the event of a vanilla Omniperf profiling run (e.g. no IP block filtering) this PR reduces the num. of required application replays from 24 -> 15. In the study of mixbench profiling performance (below), I found this leads to a ~27% improvement.
Note: This test was ran using a production rocprofiler build. We can expect an even larger improvement when this is applied in combination with profiler performance enhancements in a future release.
Runtime (sec) | # of req. application replays | |
---|---|---|
Original code | 32.28 | 24 |
This PR | 23.67 | 15 |
CC: @koomie
Reduces number of calls to rocprof by improving perfmon coalescing.