ROCm / rocprofiler

ROC profiler library. Profiling with perf-counters and derived metrics.
https://rocm.docs.amd.com/projects/rocprofiler/en/latest/
MIT License
132 stars 49 forks source link

[Issue]: MfmaUtil from derived metric seems incorrect #147

Closed stephen-youn closed 3 weeks ago

stephen-youn commented 1 month ago

Problem Description

I am trying to get the tensor core (or mfma in mi300x) utilization from hw counter from rocprof, so trying to use TENSOR_ACTIVE or MfmaUtil derived metric from https://github.com/ROCm/rocprofiler/blob/7fa8139944668a80e94f45e97eda959e33474297/src/core/counters/derived/metrics.xml#L399 but the number reported for multiplying float16 matrices [8192,8192] x [8192, 8192] is 11% with the aobve metric while the flops measured (by 2*M*N*K/measured_latency formula) was 603tflops , which is 46% use of the 1.3Pflops peak flops in mi300x. so it's 11% vs 46% and wonder how to interprete this 11% number. (i already changed CU_NUM to 304 from 320 when computing TENSOR_ACTIVE as only 38 cus are used for a xcd in mi300x)

MfmaUtil expr=100*SQ_VALU_MFMA_BUSY_CYCLES/(GRBM_GUI_ACTIVE*CU_NUM*4)

would it be that (1) CU_NUM has to be 304 (not 320) and (2) a factor of 4 should be gone like MfmaUtil expr=100*SQ_VALU_MFMA_BUSY_CYCLES/(GRBM_GUI_ACTIVE*304) ? then its 44% which is close to 46% from meausred latency

Operating System

5.15.0-116-generic #126-Ubuntu SMP

CPU

AMD EPYC 9654 96-Core Processor

GPU

AMD Instinct MI300X

ROCm Version

ROCm 6.2.1

ROCm Component

rocprofiler

Steps to Reproduce

run the following torch code with rocprofiler

test.py M, N, K = 8192, 8192, 8192 a = torch.randn((K, N), device='cuda', dtype=dtype) b = torch.randn((M, K), device='cuda', dtype=dtype) torch.matmul(a,b) input.txt pmc: SQ_WAVES GRBM_COUNT GRBM_GUI_ACTIVE SQ_INSTS_VALU SQ_INSTS_VALU_MFMA_F16 SQ_INSTS_MFMA SQ_VALU_MFMA_BUSY_CYCLES GRBM_GUI_ACTIVE CU_NUM SQ_INSTS_VALU_MFMA_F32 SQ_BUSY_CU_CYCLES rocprofv3 --kernel-trace -i input.txt -o counter1.txt -- python test.py

from the csv files from rocprof, compute tflops from kernel.csv and compute TENSOR_ACTIVE from counter_collection.csv

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

ppanchad-amd commented 1 month ago

Hi @stephen-youn. Internal ticket has been created to assist with your issue. Thanks!

stephen-youn commented 1 month ago

i ran extra experiments to figure out the relationship between the measured utilization, which is 2 m n * k /measured_latency /peak_flop, and the MfmaUtil, which is SQ_VALU_MFMA_BUSY_CYCLES / GRBM_GUI_ACTIVE / cu_num. I at least expected a linear fitting would work but it seems it does not even have linear relationship. tensor core utilization number from rocprofiler's counter, MfmaUtil tends to underestimate the utilization, especially when it's dominated by the memory bw (like skinny matmul) than the compute. image figure1. 'Fit x="MfmaUtil" to y="measured utilization"', the gemm ran in mi300x for 157 different shapes with different tranpose param (NN, NT., TN. TT). those matmul shapes were from the actual llama2 and mixtral model training runs.

stephen-youn commented 1 month ago

with the original definition of MfmaUtil in rockprofiler, which is SQ_VALU_MFMA_BUSY_CYCLES / GRBM_GUI_ACTIVE / 320 / 4 for mi300x, the gap is even larger image

jamesxu2 commented 1 month ago

Hi @stephen-youn,

Thanks for providing a reproducer. I've run this on MI210 and collected some values per your calculations.

which is 46% use of the 1.3Pflops peak flops in mi300x

Where does this 1.3PFlops number come from? I see one source being Wikipedia which quotes a ~1.3Pflops fp16 peak flops, but I do not know under what conditions this value was measured. A variation in the GPU clock speeds might change the measured latency for instance - the different XCDs on the MI300X exhibit different clock speeds. There are many factors that might cause a deviation between the absolute peak value and your measured value as you can find in the MI300X tuning documentation.

a factor of 4 should be gone [...]

I believe this is to normalize to the SIMD count, which contain the actual MFMA units/tensor cores. There are 4 SIMDs per CU on MI300X. If you have a look at the CDNA3 whitepaper you can see that while there are 304 CUs, there are 1216 matrix cores (ref. p25 of above whitepaper).

As to the nonlinear relationship between the MFMAUtil and "measured utilization", I don't really have an explanation for that. It's possible there are some nonlinear effects on FLOPs with scale. I will poke around internally for some more information.

stephen-youn commented 3 weeks ago

the reference 1.3Pflops (peak flops) is from the spec, which is 1307.4 TFlops from https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/[amd-cdna-3-white-paper.pdf

jamesxu2 commented 3 weeks ago

Resolved this issue internally, but for others who are interested in measuring the practical peak FLOPs on their machine, I recommend using the roofline analysis tools in rocprof-compute.