Open stephen-youn opened 1 week ago
Hi @stephen-youn. Internal ticket has been created to assist with your issue. Thanks!
i ran extra experiments to figure out the relationship between the measured utilization, which is 2 m n * k /measured_latency /peak_flop, and the MfmaUtil, which is SQ_VALU_MFMA_BUSY_CYCLES / GRBM_GUI_ACTIVE / cu_num. I at least expected a linear fitting would work but it seems it does not even have linear relationship. tensor core utilization number from rocprofiler's counter, MfmaUtil tends to underestimate the utilization, especially when it's dominated by the memory bw (like skinny matmul) than the compute. figure1. 'Fit x="MfmaUtil" to y="measured utilization"', the gemm ran in mi300x for 157 different shapes with different tranpose param (NN, NT., TN. TT). those matmul shapes were from the actual llama2 and mixtral model training runs.
with the original definition of MfmaUtil in rockprofiler, which is SQ_VALU_MFMA_BUSY_CYCLES / GRBM_GUI_ACTIVE / 320 / 4 for mi300x, the gap is even larger
Hi @stephen-youn,
Thanks for providing a reproducer. I've run this on MI210 and collected some values per your calculations.
which is 46% use of the 1.3Pflops peak flops in mi300x
Where does this 1.3PFlops number come from? I see one source being Wikipedia which quotes a ~1.3Pflops fp16 peak flops, but I do not know under what conditions this value was measured. A variation in the GPU clock speeds might change the measured latency for instance - the different XCDs on the MI300X exhibit different clock speeds. There are many factors that might cause a deviation between the absolute peak value and your measured value as you can find in the MI300X tuning documentation.
a factor of 4 should be gone [...]
I believe this is to normalize to the SIMD count, which contain the actual MFMA units/tensor cores. There are 4 SIMDs per CU on MI300X. If you have a look at the CDNA3 whitepaper you can see that while there are 304 CUs, there are 1216 matrix cores (ref. p25 of above whitepaper).
As to the nonlinear relationship between the MFMAUtil and "measured utilization", I don't really have an explanation for that. It's possible there are some nonlinear effects on FLOPs with scale. I will poke around internally for some more information.
the reference 1.3Pflops (peak flops) is from the spec, which is 1307.4 TFlops from https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/[amd-cdna-3-white-paper.pdf
Problem Description
I am trying to get the tensor core (or mfma in mi300x) utilization from hw counter from rocprof, so trying to use TENSOR_ACTIVE or MfmaUtil derived metric from https://github.com/ROCm/rocprofiler/blob/7fa8139944668a80e94f45e97eda959e33474297/src/core/counters/derived/metrics.xml#L399 but the number reported for multiplying float16 matrices [8192,8192] x [8192, 8192] is 11% with the aobve metric while the flops measured (by
2*M*N*K/measured_latency
formula) was 603tflops , which is 46% use of the 1.3Pflops peak flops in mi300x. so it's 11% vs 46% and wonder how to interprete this 11% number. (i already changed CU_NUM to 304 from 320 when computing TENSOR_ACTIVE as only 38 cus are used for a xcd in mi300x)MfmaUtil expr=100*SQ_VALU_MFMA_BUSY_CYCLES/(GRBM_GUI_ACTIVE*CU_NUM*4)
would it be that (1) CU_NUM has to be 304 (not 320) and (2) a factor of 4 should be gone like
MfmaUtil expr=100*SQ_VALU_MFMA_BUSY_CYCLES/(GRBM_GUI_ACTIVE*304)
? then its 44% which is close to 46% from meausred latencyOperating System
5.15.0-116-generic #126-Ubuntu SMP
CPU
AMD EPYC 9654 96-Core Processor
GPU
AMD Instinct MI300X
ROCm Version
ROCm 6.2.1
ROCm Component
rocprofiler
Steps to Reproduce
run the following torch code with rocprofiler
test.py
M, N, K = 8192, 8192, 8192 a = torch.randn((K, N), device='cuda', dtype=dtype) b = torch.randn((M, K), device='cuda', dtype=dtype) torch.matmul(a,b)
input.txtpmc: SQ_WAVES GRBM_COUNT GRBM_GUI_ACTIVE SQ_INSTS_VALU SQ_INSTS_VALU_MFMA_F16 SQ_INSTS_MFMA SQ_VALU_MFMA_BUSY_CYCLES GRBM_GUI_ACTIVE CU_NUM SQ_INSTS_VALU_MFMA_F32 SQ_BUSY_CU_CYCLES
rocprofv3 --kernel-trace -i input.txt -o counter1.txt -- python test.pyfrom the csv files from rocprof, compute tflops from kernel.csv and compute TENSOR_ACTIVE from counter_collection.csv
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response