Speed comparison between tensor parallel and pipeline parallel

kisseternity commented 1 year ago

Hello, I have compared the training speed between tensor parallel and pipeline parallel in Megatron with a DGX A100 node. I find that when the micro-batch-size and gradient accumulation steps are big enough, pure pipeline parallism runs faster than pure tensor parallelism, with over 30% speed up. But in the paper Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM, the conclusion is that inside a server node, it's best practice to use only tensor parallel to achieve highest TFLOPS.

What I'm curious is that does the TFLOPS equal to training speed? As tensor parallel spends some computation on all_reduce or other communication steps, it seems to have much more to compute than pipeline parallelism naturally. Here is my test environment: DGX A100 node with 8 GPUs. 10 billion, 8 billion and 1 billion params Bert in Megatron by adjusting the hyper-parameters. The comparison is done using 8 tensor parallelism size and 8 pipeline parallelism. Any suggestion and idea is helpful, thanks.

github-actions[bot] commented 1 year ago

Marking as stale. No activity in 60 days. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 10 months ago

Marking as stale. No activity in 60 days.

NVIDIA / Megatron-LM

Speed comparison between tensor parallel and pipeline parallel #258