NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.42k stars 2.12k forks source link

Speed comparison between tensor parallel and pipeline parallel #258

Open kisseternity opened 1 year ago

kisseternity commented 1 year ago

Hello, I have compared the training speed between tensor parallel and pipeline parallel in Megatron with a DGX A100 node. I find that when the micro-batch-size and gradient accumulation steps are big enough, pure pipeline parallism runs faster than pure tensor parallelism, with over 30% speed up. But in the paper Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM, the conclusion is that inside a server node, it's best practice to use only tensor parallel to achieve highest TFLOPS. image

What I'm curious is that does the TFLOPS equal to training speed? As tensor parallel spends some computation on all_reduce or other communication steps, it seems to have much more to compute than pipeline parallelism naturally. Here is my test environment: DGX A100 node with 8 GPUs. 10 billion, 8 billion and 1 billion params Bert in Megatron by adjusting the hyper-parameters. The comparison is done using 8 tensor parallelism size and 8 pipeline parallelism. Any suggestion and idea is helpful, thanks.

github-actions[bot] commented 1 year ago

Marking as stale. No activity in 60 days. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 10 months ago

Marking as stale. No activity in 60 days.