Performance metrics like TFLOPS (10^12 floating-point operations per second) and memory bandwidth utilization (GB per second) are crucial for optimizing the matrix multiplication operators's performance and how those operators utilize the GPU hardware. These metrics are not immediately available from the trace but can be derived from the traces using the operator input dimension, kernel execution time, etc. Thus, we request that these TFLOPS metrics be added to HTA.
Description
FLOPS calculation
Assuming a matrix multiplication $A{M \times K} \times B{K \times N}$ takes $t$ seconds to finish, we can compute the TFLOPS by
$TFLOPS = 2 \times 10^{-9} \times (K - 1) \times M \times N / t$.
Here, $M$, $K$, and $N$ can be extracted by the "input_dim" column; $t$ is the duration that the operator's GPU kernels are executed on the GPU.
🚀 Motivation and context
Performance metrics like TFLOPS (10^12 floating-point operations per second) and memory bandwidth utilization (GB per second) are crucial for optimizing the matrix multiplication operators's performance and how those operators utilize the GPU hardware. These metrics are not immediately available from the trace but can be derived from the traces using the operator input dimension, kernel execution time, etc. Thus, we request that these TFLOPS metrics be added to HTA.
Description
FLOPS calculation
Assuming a matrix multiplication $A{M \times K} \times B{K \times N}$ takes $t$ seconds to finish, we can compute the TFLOPS by $TFLOPS = 2 \times 10^{-9} \times (K - 1) \times M \times N / t$.
Here, $M$, $K$, and $N$ can be extracted by the "input_dim" column; $t$ is the duration that the operator's GPU kernels are executed on the GPU.
Alternatives
No response
Additional context
No response