The training speed of 2 nodes with total 4 A10 GPUs is much slower than that of single node with 2 A10 GPUs.

yangkang2318 commented 3 months ago

I utilized zero-stage3 of Deepspeed for fine-tuning the Qianwen2-1.5b model and observed that the training speed of 2 nodes with total 4 A10 GPUs is one time slower than that of single node with total 2 A10 GPUs. Here are some details.

The training speed of 2 nodes with total 4 A10 GPUs：

The training speed is about 8.68s/iter, and the forward latency, backward latency is 1.6s, 2.51s.

However, the training speed of single node with total 2 A10 GPUs： image (2)

The training speed is about 2.46s/iter, and the forward latency, backward latency is 357ms, 673ms.

The above results show that the training speed of 2-node 4GPUs is much slower than that of single-node GPUs in feedforward and feedback processes. I thought it was a bandwidth problem of network, but my calculations showed it wasn't as follws：

The average receiving and sending bandwidths during the training were 8.74Gbit/s and 9.28Gbit/s, respectively. model weights size: 1.5*(10^8)*16bit, gradient size: 1.5*(10^8)*16bit, the communication consume is： 3*1.5*(10^8)*16bit/2/(8.74*(10^9))=0.41s.

So I want to know what's wrong with the results? I'd like to ask for people's help.

Thanks!

jklj077 commented 3 months ago

billion is 10^9

DeepSpeed ZeRO Stage 3 or FSDP style distributed training has very high demand on bandwidth between the cards. Since A10 has 24GB memory, you can get much better throughput by justing using plain DDP.

yangkang2318 commented 3 months ago

@jklj077 谢谢老哥，我是sb，这都能弄错了。。。

QwenLM / Qwen2.5

The training speed of 2 nodes with total 4 A10 GPUs is much slower than that of single node with 2 A10 GPUs. #773