NVIDIA-developer-blog / code-samples

Source code examples from the Parallel Forall Blog
BSD 3-Clause "New" or "Revised" License
1.24k stars 633 forks source link

How to calculate TFLOPS in LSTM.cu #7

Closed yhuanghamu closed 8 years ago

yhuanghamu commented 8 years ago

The output of this code is runtime, but what i want to compare is throughput, how do i convert the runtime into TFLOPS. I mean how the computation is related to the other parameters.

JAppleyard commented 8 years ago

The vast majority of FLOPs in an LSTM are in the matrix multiplications. A single matrix multiplication requires 2MN(K+1) FLOPs. There are 8 matrix multiplications per layer per timestep, and in this case M=K=hiddenSize, N=minibatch. Therefore the total FLOPs are:

layers * timesteps * 8 * 2 * hiddenSize * minibatch * (hiddenSize + 1).

In reality there's a few more FLOPs due to biases and activation functions. These are only significant if hiddenSize is very small as they scale linearly with hiddenSize rather than with the square.

In any case, this will give you the total approximate number of FLOPs. Divide by time and multiply by 10^-12, and you have TFLOPS.

xiezhq-hermann commented 5 years ago

@JAppleyard Hi, why a single matrix multiplication requires 2MN(K+1), I mean, it should be MN(2K-1) right?