Closed gsitaram closed 8 months ago
We saw this error with another workload today. If there is any insight, would be good to have.
Hi Gina. Thank you for reporting this issue. Further investigation of your workload (specifically _workloads/current/mi200/pmcperf.csv) has uncovered multiple dispatches where GRBM_GUI_ACTIVE
is 0. I'll have to confer with a hardware expert, but I believe this should always be non-zero.
This issue is arising because when attempting to eval the Python expression, it's attempting division by a NaN.
https://github.com/AMDResearch/omniperf/blob/9770396fa8d75e2d72ead30890cb9d232ff6ea4a/src/omniperf_analyze/utils/parser.py#L481-L492
This snowballs into a larger issue when we see metrics using GRBM_GUI_ACTIVE
begin to report inf which can be attributed to Python eval()'s known inf issues.
$ ./src/omniperf analyze -p workloads/current/mi200/ -b 2.1.8 -g
--------
Analyze
--------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
raw pmc df info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17845 entries, 0 to 17844
Columns: 1128 entries, ('SQ_IFETCH_LEVEL', 'Index') to ('pmc_perf', 'CompleteNs')
dtypes: float64(104), int64(1005), object(18), uint64(1)
memory usage: 153.6+ MB
None
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
filtered pmc df info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17845 entries, 0 to 17844
Columns: 1128 entries, ('SQ_IFETCH_LEVEL', 'Index') to ('pmc_perf', 'CompleteNs')
dtypes: float64(104), int64(1005), object(18), uint64(1)
memory usage: 153.6+ MB
None
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expression:
Value =
to_avg(((100 * raw_pmc_df.get('pmc_perf').get("SQ_ACTIVE_INST_SCA")) / (raw_pmc_df.get('pmc_perf').get("GRBM_GUI_ACTIVE") * ammolite__numCU)))
Inputs:
Var ammolite__numCU : 104
Output:
inf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expression:
Peak =
100
Inputs:
Output:
100
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expression:
PoP =
to_avg(((100 * raw_pmc_df.get('pmc_perf').get("SQ_ACTIVE_INST_SCA")) / (raw_pmc_df.get('pmc_perf').get("GRBM_GUI_ACTIVE") * ammolite__numCU)))
Inputs:
Var ammolite__numCU : 104
Output:
inf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
--------------------------------------------------------------------------------
2. System Speed-of-Light
╒═════════╤═══════════╤═════════╤════════╤════════╤═══════╕
│ Index │ Metric │ Value │ Unit │ Peak │ PoP │
╞═════════╪═══════════╪═════════╪════════╪════════╪═══════╡
│ 2.1.8 │ SALU Util │ inf │ Pct │ 100 │ inf │
╘═════════╧═══════════╧═════════╧════════╧════════╧═══════╛
Before implementing a full-fledged patch I'd like to understand why rocprof is reporting these numbers. At the very least we will update code to throw a warning if illogical GRBM_GUI_ACTIVE
is detected.
I am seeing this in a situation where I am analyzing the top N kernels, 1 by 1. A small subset of the kernels are showing this error. Why wouldn't I see the error for all kernels?
@PaulMullowney we know this error is triggered when arithmetic encounters a dispatch where (GRBM_GUI_ACTIVE
== 0). My guess is some kernels in your workload aren't hitting this condition and when you filter those kernels, the error goes away.
As discussed in Teams chat we have a few tests planned to help clarify why (GRBM_GUI_ACTIVE
== 0) is being reported. One of which will be separating pmc_perf.txt
input file line by line to rule out any merge issues...
I'll follow up in the next few days after running these tests
One of which will be separating pmc_perf.txt input file line by line to rule out any merge issues... I'll follow up in the next few days after running these tests
Update:
Workaround mentioned above is now implemented in dev
. Reaching out to Paul to see if this will solve his issue.
We've updated the Omniperf code s.t. anytime this GRBM_GUI_ACTIVE
issue occurs we throw a helpful warning and fail gracefully.
Our custom merge utility didn't fix the original issue so we've passed the issue to rocprof team. Awaiting response... https://ontrack-internal.amd.com/browse/SWDEV-402481
Pushing issue to a future milestone
While the underlying issue seems to still be present in rocprofiler: https://ontrack-internal.amd.com/browse/SWDEV-402481
Omniperf will catch the bug and throw a warning via the above commits. Closing issue.
Can this error be worked around?
Omniperf version I am using: