Question on 7B H100 MFU

foundation-model-stack / fms-fsdp

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.

https://pytorch.org/docs/stable/fsdp.html

Apache License 2.0

114 stars 18 forks source link

Question on 7B H100 MFU #89

Closed jasonkrone closed 1 month ago

jasonkrone commented 1 month ago

The readme lists the throughput for the 7B model on the H100s as 9600 toks / sec. The 7B model performs ~0.05 TFLOP per token, so the FLOPs per sec is 9600 * 0.5 = 480 TFLOPs / sec. When I look up the max TFLOPs per sec of the H100 SXM using bfloat16 here, it gives 1979 TFLOPs / sec. This gives an MFU of 480 / 1979 = 24.3%, which differs from the 46% MFU given on the readme. Curious what's causing the delta!

P.S. This project is a great resource - really appreciate all the work on it!

lchu-ibm commented 1 month ago

Hi @jasonkrone , the reported number you have is "sparsity" number which is a number doubled from the actual bf16 tensor core speed. The actual number to use would be 989.

jasonkrone commented 1 month ago

Thank you - makes sense! :)