ROCm / rocprofiler-compute

Advanced Profiling and Analytics for AMD Hardware
https://rocm.docs.amd.com/projects/omniperf/en/latest/
MIT License
135 stars 49 forks source link

Errors with CLI analysis #103

Closed gsitaram closed 8 months ago

gsitaram commented 1 year ago

Can this error be worked around?

$ omniperf analyze -p workloads/current/mi200 

--------
Analyze
--------

/opt/omniperf/bin/omniperf_analyze/utils/parser.py:164: RuntimeWarning: invalid value encountered in scalar remainder
  return a % b
Traceback (most recent call last):
  File "/opt/omniperf/bin/omniperf", line 828, in <module>
    main()
  File "/opt/omniperf/bin/omniperf", line 808, in main
    analyze(args)
  File "/opt/omniperf/bin/omniperf_analyze/omniperf_analyze.py", line 284, in analyze
    run_cli(args, runs)
  File "/opt/omniperf/bin/omniperf_analyze/omniperf_analyze.py", line 198, in run_cli
    parser.load_table_data(
  File "/opt/omniperf/bin/omniperf_analyze/utils/parser.py", line 704, in load_table_data
    eval_metric(
  File "/opt/omniperf/bin/omniperf_analyze/utils/parser.py", line 487, in eval_metric
    ammolite__build_in[key] = eval(compile(s, "<string>", "eval"))
  File "<string>", line 2, in <module>
  File "/opt/omniperf/bin/omniperf_analyze/utils/parser.py", line 143, in to_int
    return int(a)
ValueError: cannot convert float NaN to integer

Omniperf version I am using:

$ omniperf --version
----------------------------------------
Omniperf version: 1.0.8-PR1 (release)
Git revision:     ac10ad2
----------------------------------------
gsitaram commented 1 year ago

We saw this error with another workload today. If there is any insight, would be good to have.

coleramos425 commented 1 year ago

Hi Gina. Thank you for reporting this issue. Further investigation of your workload (specifically _workloads/current/mi200/pmcperf.csv) has uncovered multiple dispatches where GRBM_GUI_ACTIVE is 0. I'll have to confer with a hardware expert, but I believe this should always be non-zero.

This issue is arising because when attempting to eval the Python expression, it's attempting division by a NaN. https://github.com/AMDResearch/omniperf/blob/9770396fa8d75e2d72ead30890cb9d232ff6ea4a/src/omniperf_analyze/utils/parser.py#L481-L492 This snowballs into a larger issue when we see metrics using GRBM_GUI_ACTIVE begin to report inf which can be attributed to Python eval()'s known inf issues.

$ ./src/omniperf analyze -p workloads/current/mi200/ -b 2.1.8 -g

--------
Analyze
--------

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
raw pmc df info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17845 entries, 0 to 17844
Columns: 1128 entries, ('SQ_IFETCH_LEVEL', 'Index') to ('pmc_perf', 'CompleteNs')
dtypes: float64(104), int64(1005), object(18), uint64(1)
memory usage: 153.6+ MB
None
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
filtered pmc df info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17845 entries, 0 to 17844
Columns: 1128 entries, ('SQ_IFETCH_LEVEL', 'Index') to ('pmc_perf', 'CompleteNs')
dtypes: float64(104), int64(1005), object(18), uint64(1)
memory usage: 153.6+ MB
None
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expression:
Value = 
to_avg(((100 * raw_pmc_df.get('pmc_perf').get("SQ_ACTIVE_INST_SCA")) / (raw_pmc_df.get('pmc_perf').get("GRBM_GUI_ACTIVE") * ammolite__numCU)))

Inputs:
Var  ammolite__numCU : 104

Output:
inf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expression:
Peak = 
100

Inputs:

Output:
100
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expression:
PoP = 
to_avg(((100 * raw_pmc_df.get('pmc_perf').get("SQ_ACTIVE_INST_SCA")) / (raw_pmc_df.get('pmc_perf').get("GRBM_GUI_ACTIVE") * ammolite__numCU)))

Inputs:
Var  ammolite__numCU : 104

Output:
inf
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--------------------------------------------------------------------------------
2. System Speed-of-Light
╒═════════╤═══════════╤═════════╤════════╤════════╤═══════╕
│ Index   │ Metric    │   Value │ Unit   │   Peak │   PoP │
╞═════════╪═══════════╪═════════╪════════╪════════╪═══════╡
│ 2.1.8   │ SALU Util │     inf │ Pct    │    100 │   inf │
╘═════════╧═══════════╧═════════╧════════╧════════╧═══════╛

Before implementing a full-fledged patch I'd like to understand why rocprof is reporting these numbers. At the very least we will update code to throw a warning if illogical GRBM_GUI_ACTIVE is detected.

PaulMullowney commented 1 year ago

I am seeing this in a situation where I am analyzing the top N kernels, 1 by 1. A small subset of the kernels are showing this error. Why wouldn't I see the error for all kernels?

coleramos425 commented 1 year ago

@PaulMullowney we know this error is triggered when arithmetic encounters a dispatch where (GRBM_GUI_ACTIVE == 0). My guess is some kernels in your workload aren't hitting this condition and when you filter those kernels, the error goes away.

As discussed in Teams chat we have a few tests planned to help clarify why (GRBM_GUI_ACTIVE == 0) is being reported. One of which will be separating pmc_perf.txt input file line by line to rule out any merge issues...

I'll follow up in the next few days after running these tests

coleramos425 commented 1 year ago

One of which will be separating pmc_perf.txt input file line by line to rule out any merge issues... I'll follow up in the next few days after running these tests

Update: Workaround mentioned above is now implemented in dev. Reaching out to Paul to see if this will solve his issue.

coleramos425 commented 1 year ago

We've updated the Omniperf code s.t. anytime this GRBM_GUI_ACTIVE issue occurs we throw a helpful warning and fail gracefully.

Our custom merge utility didn't fix the original issue so we've passed the issue to rocprof team. Awaiting response... https://ontrack-internal.amd.com/browse/SWDEV-402481

Pushing issue to a future milestone

coleramos425 commented 8 months ago

While the underlying issue seems to still be present in rocprofiler: https://ontrack-internal.amd.com/browse/SWDEV-402481

Omniperf will catch the bug and throw a warning via the above commits. Closing issue.