NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

NCCL alltoall_perf hangs via PXN #187

Closed gavin1332 closed 7 months ago

gavin1332 commented 7 months ago

The cluster contains 14 nodes of 112 A800 GPUs, and the topology of network is as follows: image

By running alltoall_perf on all 14 nodes with PXN off by setting NCCL_P2P_PXN_LEVEL=0, the bandwidth is ~9 GB/s.

With PXN on, by running all_reduce_perf on all 14 nodes, the bandwidth is ~95 GB/s, but it hangs by alltoall_perf in which case our network engineering observed the "Progress threads" in nccl are idle during hanging.

We have tested alltoall_perf successfully in different node pairing and even in several compositions of 8 nodes with PXN on or off. In case of 8 nodes, we geting bandwidth 12 GB/s with PXN on, and 10 GB/s with PXN off.

Could you help us analyze why it hangs while running alltoall_perf on all nodes, or give some suggestions to keep working on the problem fixing.

Docker image: nvcr.io/nvidia/pytorch:23.11-py3

gavin1332 commented 7 months ago

bug is found in NCCL cumem API