Closed FeixLiu closed 1 year ago
I assume you are referencing Figure 9 from the white paper linked from that blog? If so, that figure is simply stating that fp8 is computationally 2x the throughput of bf16, when isolating arithmetic operations. The actual end-to-end speedup will be less than this, since you must account for other overheads like communication, memory bandwidth, and the optimizer step. The speedup will also vary greatly depending on your model size and micro batch size.
Got it, thanks for the reply!
should it possible to use fp8 with pipeline parallelism? My training gets hung up when I try to use both. I can use fp8 with model parallel ok though.
When I want to initiate the training using H100, parameter --bf16 is ok, But, if I try the same parameter with fp8 parameter, the Error OOM occurs, It confuses me a lot. The added param is
--bf16 \
--fp8-format hybrid \
--fp8-amax-compute-algo max \
--fp8-amax-history-len 16 \
--transformer-impl transformer_engine
Hi there,
I've used Megatron to train 13B gpt model on a H100 machine. Before I use fp8 transformer engine, the speed of the training is about 0.34s/step. After I enabled the fp8 transformer engine with these two arguments
--fp8-hybrid, --transformer-impl "transformer_engine"
, the speed of the training is about 0.24s/step. From this blog, the fp8 should have 100% spped up compared with bf16. But I only got 35% speed up on Megatron. Does the 35% speed up reasonable or I've made some mistakes on using fp8 transformer engine?Thanks a lot for the reply.