The cluster contains 14 nodes of 112 A800 GPUs, and the topology of network is as follows:
By running alltoall_perf on all 14 nodes with PXN off by setting NCCL_P2P_PXN_LEVEL=0, the bandwidth is ~9 GB/s.
With PXN on, by running all_reduce_perf on all 14 nodes, the bandwidth is ~95 GB/s, but it hangs by alltoall_perf in which case our network engineering observed the "Progress threads" in nccl are idle during hanging.
We have tested alltoall_perf successfully in different node pairing and even in several compositions of 8 nodes with PXN on or off. In case of 8 nodes, we geting bandwidth 12 GB/s with PXN on, and 10 GB/s with PXN off.
Could you help us analyze why it hangs while running alltoall_perf on all nodes, or give some suggestions to keep working on the problem fixing.
The cluster contains 14 nodes of 112 A800 GPUs, and the topology of network is as follows:![image](https://github.com/NVIDIA/nccl-tests/assets/1552048/4a8a5e1e-9e40-411e-9f16-007f83789e2b)
By running
alltoall_perf
on all 14 nodes with PXN off by setting NCCL_P2P_PXN_LEVEL=0, the bandwidth is ~9 GB/s.With PXN on, by running
all_reduce_perf
on all 14 nodes, the bandwidth is ~95 GB/s, but it hangs byalltoall_perf
in which case our network engineering observed the "Progress threads" in nccl are idle during hanging.We have tested
alltoall_perf
successfully in different node pairing and even in several compositions of 8 nodes with PXN on or off. In case of 8 nodes, we geting bandwidth 12 GB/s with PXN on, and 10 GB/s with PXN off.Could you help us analyze why it hangs while running
alltoall_perf
on all nodes, or give some suggestions to keep working on the problem fixing.Docker image: nvcr.io/nvidia/pytorch:23.11-py3