algorithm bandwidth of all2all

de1star commented 1 year ago

Hi, thanks for your great help that solved my problem in another issue.
I'd like to calculate the algorithm bandwidth of all2all on my cluster, but I found https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md did not mention that. May I ask how to calculate it when knowing the bandwidth and number of IBs?

sjeaugey commented 1 year ago

The alltoall busBw bandwidth is simply computed by multiplying the AlgorithmBw by (n-1)/n, since 1/n of the data is local and (n-1)/n is remote.

If you have one GPU per node the BusBw should be the NIC BW.

On a system with both NVLink and NICs, then a portion of the traffic will be local (and should not be the bottleneck; the portion that's going through the network will determine the global time, hence the reported bandwidth.

On 2 nodes, 50% of the traffic is inter-node so you should see BusBW = 2x network bandwidth per GPU. As the number of nodes increases, it will go down to 1x the network bandwidth per GPU (general formula is N/(N-1)x the bandwidth per GPU, N being the number of nodes).

de1star commented 1 year ago

Thanks for your reply! @sjeaugey

renwuli commented 1 month ago

@sjeaugey

Thanks for your explanation.

If we have M GPU per node, and each GPU is connected with one NIC, what should the theorical/ideal busbw be for single node and multi nodes?

NVIDIA / nccl-tests

algorithm bandwidth of all2all #130