NCCL alltoall_perf hangs via PXN

The cluster contains 14 nodes of 112 A800 GPUs, and the topology of network is as follows:

By running alltoall_perf on all 14 nodes with PXN off by setting NCCL_P2P_PXN_LEVEL=0, the bandwidth is ~9 GB/s.

With PXN on, by running all_reduce_perf on all 14 nodes, the bandwidth is ~95 GB/s, but it hangs by alltoall_perf in which case our network engineering observed the "Progress threads" in nccl are idle during hanging.

We have tested alltoall_perf successfully in different node pairing and even in several compositions of 8 nodes with PXN on or off. In case of 8 nodes, we geting bandwidth 12 GB/s with PXN on, and 10 GB/s with PXN off.

Could you help us analyze why it hangs while running alltoall_perf on all nodes, or give some suggestions to keep working on the problem fixing.

Docker image: nvcr.io/nvidia/pytorch:23.11-py3

NVIDIA / nccl-tests

NCCL alltoall_perf hangs via PXN #187