NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.28k stars 829 forks source link

Low bandwidth of AllReduce over long-range connection with high latancy (0.25ms) #1378

Open yanminjia opened 4 months ago

yanminjia commented 4 months ago

The bandwidth of CC such as AllReduce is pretty low when the 2 GPU nodes are on the different sides over a long-range connection (50km cable) with about 0.25ms latancy.

I know RDMA write operation is used in NCCL code. It looks RDMA write operation is pretty senstive to high latancy. Here is the test result of ib_write_bw (perftest) with 400Gbps RNIC crossing the the long-range connection. The bandwidth is much lower than expected (392Gbps).

image

When I test the AllReduce with 2 ranks, it looks the result matches ib_write_bw.

image

But when I test RDMA send operation with perftest. It looks fine.

image

It would be highly appreciated if any idea. Many thanks.