Closed yangkang2318 closed 2 months ago
billion is 10^9
DeepSpeed ZeRO Stage 3 or FSDP style distributed training has very high demand on bandwidth between the cards. Since A10 has 24GB memory, you can get much better throughput by justing using plain DDP.
@jklj077 谢谢老哥,我是sb,这都能弄错了。。。
I utilized zero-stage3 of Deepspeed for fine-tuning the Qianwen2-1.5b model and observed that the training speed of 2 nodes with total 4 A10 GPUs is one time slower than that of single node with total 2 A10 GPUs. Here are some details.
The training speed of 2 nodes with total 4 A10 GPUs:
The training speed is about 8.68s/iter, and the forward latency, backward latency is 1.6s, 2.51s.
However, the training speed of single node with total 2 A10 GPUs:
The training speed is about 2.46s/iter, and the forward latency, backward latency is 357ms, 673ms.
The above results show that the training speed of 2-node 4GPUs is much slower than that of single-node GPUs in feedforward and feedback processes. I thought it was a bandwidth problem of network, but my calculations showed it wasn't as follws:
The average receiving and sending bandwidths during the training were 8.74Gbit/s and 9.28Gbit/s, respectively. model weights size: 1.5*(10^8)*16bit, gradient size: 1.5*(10^8)*16bit, the communication consume is: 3*1.5*(10^8)*16bit/2/(8.74*(10^9))=0.41s.
So I want to know what's wrong with the results? I'd like to ask for people's help.
Thanks!