Estimate TFLOPS of PyTorch Matrix Multiplication Operators from Kineto Trace

🚀 Motivation and context

Performance metrics like TFLOPS (10^12 floating-point operations per second) and memory bandwidth utilization (GB per second) are crucial for optimizing the matrix multiplication operators's performance and how those operators utilize the GPU hardware. These metrics are not immediately available from the trace but can be derived from the traces using the operator input dimension, kernel execution time, etc. Thus, we request that these TFLOPS metrics be added to HTA.

Description

FLOPS calculation

Assuming a matrix multiplication $A{M \times K} \times B{K \times N}$ takes $t$ seconds to finish, we can compute the TFLOPS by $TFLOPS = 2 \times 10^{-9} \times (K - 1) \times M \times N / t$.

Here, $M$, $K$, and $N$ can be extracted by the "input_dim" column; $t$ is the duration that the operator's GPU kernels are executed on the GPU.

Alternatives

No response

Additional context

No response

facebookresearch / HolisticTraceAnalysis