[QUESTION] What's the internal difference for training when setting only "fp8-format" or setting "fp8-format"+"bf16"

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start

Other

9.23k stars 2.08k forks source link

[QUESTION] What's the internal difference for training when setting only "fp8-format" or setting "fp8-format"+"bf16" #883

Open dong-liuliu opened 1 week ago

dong-liuliu commented 1 week ago

Your question I'm trying to train GPT/LLAMA on top of Megatron-LM, but confused on fp8 performance.

Setting fp8 format parameters together with "--bf16" is much better than the situation without "--bf16". So what's difference between them inside Megatron-LM?

When setting fp8 + bf16, will Megatron-LM try to split some computation to bf16 if more efficient, or to fp8 for high throughtput?