NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.57k stars 2.36k forks source link

fp8 transformer engine only brings 35% speed up? #396

Closed FeixLiu closed 1 year ago

FeixLiu commented 1 year ago

Hi there,

I've used Megatron to train 13B gpt model on a H100 machine. Before I use fp8 transformer engine, the speed of the training is about 0.34s/step. After I enabled the fp8 transformer engine with these two arguments --fp8-hybrid, --transformer-impl "transformer_engine", the speed of the training is about 0.24s/step. From this blog, the fp8 should have 100% spped up compared with bf16. But I only got 35% speed up on Megatron. Does the 35% speed up reasonable or I've made some mistakes on using fp8 transformer engine?

Thanks a lot for the reply.

lmcafee-nvidia commented 1 year ago

I assume you are referencing Figure 9 from the white paper linked from that blog? If so, that figure is simply stating that fp8 is computationally 2x the throughput of bf16, when isolating arithmetic operations. The actual end-to-end speedup will be less than this, since you must account for other overheads like communication, memory bandwidth, and the optimizer step. The speedup will also vary greatly depending on your model size and micro batch size.

FeixLiu commented 1 year ago

Got it, thanks for the reply!

exnx commented 4 months ago

should it possible to use fp8 with pipeline parallelism? My training gets hung up when I try to use both. I can use fp8 with model parallel ok though.

yanchenmochen commented 3 weeks ago

When I want to initiate the training using H100, parameter --bf16 is ok, But, if I try the same parameter with fp8 parameter, the Error OOM occurs, It confuses me a lot. The added param is

    --bf16 \
    --fp8-format hybrid \
    --fp8-amax-compute-algo max \
    --fp8-amax-history-len 16 \
    --transformer-impl transformer_engine