NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.09k stars 779 forks source link

How is the logic for allocating data across different channels? #1299

Open jxh314 opened 3 months ago

jxh314 commented 3 months ago

While fine-tuning the Opt-1.3b model using DeepSpeed across two nodes—each node having two RoCE network interfaces, namely mlx5_0 and mlx5_1, I've configured NCCL to utilize the ring algorithm and inspected RDMA packets via tcpdump on particular interfaces. My findings indicate that the amount of data transmitted on every channel seems to be approximately equal. Does this imply that data is uniformly distributed among distinct channels, or is this merely a coincidence?

Additionally, if feasible, could you point me to the source code segment responsible for data distribution across channels for my perusal?

visualxu commented 2 months ago

Yes, the data will be evenly distributed across various channels. The buffSize of each channel is controlled by an environment called "NCCL_BUFFSIZE" : https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/env.html?highlight=nccl_buffsize

jxh314 commented 2 months ago

Thank you for your response. May I ask in which function or .cc file is the total transmission data evenly distributed among the channels implemented? Could you kindly point it out if possible? @visualxu