Your question
I'm trying to train GPT/LLAMA on top of Megatron-LM, but confused on fp8 performance.
Setting fp8 format parameters together with "--bf16" is much better than the situation without "--bf16". So what's difference between them inside Megatron-LM?
When setting fp8 + bf16, will Megatron-LM try to split some computation to bf16 if more efficient, or to fp8 for high throughtput?
Your question I'm trying to train GPT/LLAMA on top of Megatron-LM, but confused on fp8 performance.
Setting fp8 format parameters together with "--bf16" is much better than the situation without "--bf16". So what's difference between them inside Megatron-LM?
When setting fp8 + bf16, will Megatron-LM try to split some computation to bf16 if more efficient, or to fp8 for high throughtput?