🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
The readme lists the throughput for the 7B model on the H100s as 9600 toks / sec. The 7B model performs ~0.05 TFLOP per token, so the FLOPs per sec is 9600 * 0.5 = 480 TFLOPs / sec. When I look up the max TFLOPs per sec of the H100 SXM using bfloat16 here, it gives 1979 TFLOPs / sec. This gives an MFU of 480 / 1979 = 24.3%, which differs from the 46% MFU given on the readme. Curious what's causing the delta!
P.S. This project is a great resource - really appreciate all the work on it!
Hi @jasonkrone , the reported number you have is "sparsity" number which is a number doubled from the actual bf16 tensor core speed. The actual number to use would be 989.
The readme lists the throughput for the 7B model on the H100s as 9600 toks / sec. The 7B model performs ~0.05 TFLOP per token, so the FLOPs per sec is 9600 * 0.5 = 480 TFLOPs / sec. When I look up the max TFLOPs per sec of the H100 SXM using bfloat16 here, it gives 1979 TFLOPs / sec. This gives an MFU of 480 / 1979 = 24.3%, which differs from the 46% MFU given on the readme. Curious what's causing the delta!
P.S. This project is a great resource - really appreciate all the work on it!