Question regarding overlapping

First of all, Thank you for sharing your excellent work!

I have a question about overlapping (pingpong design). From my understanding:

1) With FP8 precision and a head dimension of 128, the exponential function seems to take the same amount of time as GEMM0 + GEMM1. This is because matmul is 512 times faster than the exponential function, and there are 512 times more FLOPs in matmul than the exponential operation.

2) The softmax function involves two MUFU: one for the exponential function and another for floating-point division. This indicates that softmax should be about twice as slow as GEMM0+GEMM1.

However, the figure provided shows that softmax takes only half as much time as GEMM0 + GEMM1. If softmax takes twice more time than GEMMs (GEMM0 + GEMM1), GEMM operations might be idle for half of the time.

So my questions are:

1) Is my understanding of the time consumption for GEMM and softmax correct? Specifically, is the time consumption really bound by MUFU for FP8 precision and a head dimension of 128? 2) If my understanding is correct, does this mean that GEMM operations are only executing for half the time while the other half of the time, they remain idle? Thank you!

Dao-AILab / flash-attention

Question regarding overlapping #1160