The bandwidth of CC such as AllReduce is pretty low when the 2 GPU nodes are on the different sides over a long-range connection (50km cable) with about 0.25ms latancy.
I know RDMA write operation is used in NCCL code. It looks RDMA write operation is pretty senstive to high latancy. Here is the test result of ib_write_bw (perftest) with 400Gbps RNIC crossing the the long-range connection. The bandwidth is much lower than expected (392Gbps).
When I test the AllReduce with 2 ranks, it looks the result matches ib_write_bw.
But when I test RDMA send operation with perftest. It looks fine.
It would be highly appreciated if any idea. Many thanks.
The bandwidth of CC such as AllReduce is pretty low when the 2 GPU nodes are on the different sides over a long-range connection (50km cable) with about 0.25ms latancy.
I know RDMA write operation is used in NCCL code. It looks RDMA write operation is pretty senstive to high latancy. Here is the test result of ib_write_bw (perftest) with 400Gbps RNIC crossing the the long-range connection. The bandwidth is much lower than expected (392Gbps).
When I test the AllReduce with 2 ranks, it looks the result matches ib_write_bw.
But when I test RDMA send operation with perftest. It looks fine.
It would be highly appreciated if any idea. Many thanks.