Hello, I have compared the training speed between tensor parallel and pipeline parallel in Megatron with a DGX A100 node.
I find that when the micro-batch-size and gradient accumulation steps are big enough, pure pipeline parallism runs faster than
pure tensor parallelism, with over 30% speed up. But in the paper Efficient Large-Scale Language Model Training on GPU Clusters
Using Megatron-LM, the conclusion is that inside a server node, it's best practice to use only tensor parallel to achieve highest
TFLOPS.
What I'm curious is that does the TFLOPS equal to training speed? As tensor parallel spends some computation on all_reduce or other
communication steps, it seems to have much more to compute than pipeline parallelism naturally.
Here is my test environment:
DGX A100 node with 8 GPUs.
10 billion, 8 billion and 1 billion params Bert in Megatron by adjusting the hyper-parameters.
The comparison is done using 8 tensor parallelism size and 8 pipeline parallelism.
Any suggestion and idea is helpful, thanks.
Hello, I have compared the training speed between tensor parallel and pipeline parallel in Megatron with a DGX A100 node. I find that when the micro-batch-size and gradient accumulation steps are big enough, pure pipeline parallism runs faster than pure tensor parallelism, with over 30% speed up. But in the paper Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM, the conclusion is that inside a server node, it's best practice to use only tensor parallel to achieve highest TFLOPS.![image](https://user-images.githubusercontent.com/15059072/199716702-6269e827-32c1-465a-8edd-1096de33e874.png)
What I'm curious is that does the TFLOPS equal to training speed? As tensor parallel spends some computation on all_reduce or other communication steps, it seems to have much more to compute than pipeline parallelism naturally. Here is my test environment: DGX A100 node with 8 GPUs. 10 billion, 8 billion and 1 billion params Bert in Megatron by adjusting the hyper-parameters. The comparison is done using 8 tensor parallelism size and 8 pipeline parallelism. Any suggestion and idea is helpful, thanks.