[QST] Why is 3xTF32 faster than 1xTF32?

TUMSchieben commented 1 year ago

What is your question? I've learned example 27_ampere_3xtf32_fast_accurate_tensorop_gemm, which says the following:

1xTF32: FP32 in, converted to one TF32 internally, accumulated in FP32, FP32 out 3xTF32: FP32 in, converted in TF32-big and TF32-small internally, accumulated in FP32, FP32 out

From my understanding, 1xTF32 has 1 TF32 mad operation while 3xTF32 has 3 TF32 mad operations. And both modes do FP32 accumulation. Is this right? Then why is 3xTF32 faster than 1xTF32?

hwu36 commented 1 year ago

It is slower than 1xtf32 tensor cores, but faster than fp32 simt.

TUMSchieben commented 1 year ago

It is slower than 1xtf32 tensor cores, but faster than fp32 simt.

But from my experiment, performance of 3xTF32 is much better...

mnicely commented 1 year ago

@TUMSchieben I would argue TF32 doesn't do FP32 accumulation. TF32 Tensor Cores take FP32 input, performance accumulation at lower precision, and returns a FP32 output. It is in no way meant to be a replacement for FP32, it's meant for workloads that don't need FP32 precision but need more than FP16 range.

With 3xTF32, we are emulating a non-IEEE complaint FP32 result. You want to use this when you need more precision that FP16, but don't need an IEEE complaint FP32 result (e.g., Deep Learning). 3xTF32 is not perfect, as accuracy is based on the size of K. If K is small, less than 256, the accuracy is less that FP32. As K increases, the accuracy become much better than FP32 relative to FP64.

TF32 has a peak throughput of 156 TOPs. With 3xTF32 (three TF32 operations), theoretical peak is 156/3 -- 52TOPs. Check out slide 25 from my GTC Spring '22 talk and you'll see that we get 48TOPs at best.

Therefore, if you are seeing better performance with 3xTF32 over TF32, something is wrong.

TUMSchieben commented 1 year ago

@TUMSchieben I would argue TF32 doesn't do FP32 accumulation. TF32 Tensor Cores take FP32 input, performance accumulation at lower precision, and returns a FP32 output. It is in no way meant to be a replacement for FP32, it's meant for workloads that don't need FP32 precision but need more than FP16 range.

With 3xTF32, we are emulating a non-IEEE complaint FP32 result. You want to use this when you need more precision that FP16, but don't need an IEEE complaint FP32 result (e.g., Deep Learning). 3xTF32 is not perfect, as accuracy is based on the size of K. If K is small, less than 256, the accuracy is less that FP32. As K increases, the accuracy become much better than FP32 relative to FP64.

TF32 has a peak throughput of 156 TOPs. With 3xTF32 (three TF32 operations), theoretical peak is 156/3 -- 52TOPs. Check out slide 25 from my GTC Spring '22 talk and you'll see that we get 48TOPs at best.

Therefore, if you are seeing better performance with 3xTF32 over TF32, something is wrong.

Thx for the explanation. The issue is resolved as I misplaced the math op "OpMultiplyAddFastF32"/"OpMultiplyAdd" setting.

NVIDIA / cutlass

[QST] Why is 3xTF32 faster than 1xTF32? #729