How to calculate TFLOPS in LSTM.cu

yhuanghamu commented 8 years ago

The output of this code is runtime, but what i want to compare is throughput, how do i convert the runtime into TFLOPS. I mean how the computation is related to the other parameters.

JAppleyard commented 8 years ago

The vast majority of FLOPs in an LSTM are in the matrix multiplications. A single matrix multiplication requires 2MN(K+1) FLOPs. There are 8 matrix multiplications per layer per timestep, and in this case M=K=hiddenSize, N=minibatch. Therefore the total FLOPs are:

layers * timesteps * 8 * 2 * hiddenSize * minibatch * (hiddenSize + 1).

In reality there's a few more FLOPs due to biases and activation functions. These are only significant if hiddenSize is very small as they scale linearly with hiddenSize rather than with the square.

In any case, this will give you the total approximate number of FLOPs. Divide by time and multiply by 10^-12, and you have TFLOPS.

xiezhq-hermann commented 5 years ago

@JAppleyard Hi, why a single matrix multiplication requires 2MN(K+1), I mean, it should be MN(2K-1) right?

NVIDIA-developer-blog / code-samples

How to calculate TFLOPS in LSTM.cu #7