[BUG] Modify FLOPs in MFU calculation for casual mask when using FlashAttention.

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start

Other

10.13k stars 2.28k forks source link

[BUG] Modify FLOPs in MFU calculation for casual mask when using FlashAttention. #831

Open Yuxin-CV opened 4 months ago

Yuxin-CV commented 4 months ago

Hi, I suggest we modify the FLOPs calculation in the MFU according to the FlashAttention benchmark script.

Specifically, the current calculation for the casual mask can exceed 100% MFU for seq_len = 16k (189 * 2 / 312 = 1.21), which is inaccurate. The FLOPs for the casual mask setting should be divided by 2 when using FlashAttention.

github-actions[bot] commented 2 months ago

Marking as stale. No activity in 60 days.