Low bandwidth of AllReduce over long-range connection with high latancy (0.25ms)

The bandwidth of CC such as AllReduce is pretty low when the 2 GPU nodes are on the different sides over a long-range connection (50km cable) with about 0.25ms latancy.

I know RDMA write operation is used in NCCL code. It looks RDMA write operation is pretty senstive to high latancy. Here is the test result of ib_write_bw (perftest) with 400Gbps RNIC crossing the the long-range connection. The bandwidth is much lower than expected (392Gbps).

When I test the AllReduce with 2 ranks, it looks the result matches ib_write_bw.

But when I test RDMA send operation with perftest. It looks fine.

It would be highly appreciated if any idea. Many thanks.

NVIDIA / nccl

Low bandwidth of AllReduce over long-range connection with high latancy (0.25ms) #1378