fix: nvls all reduce correction factor

OrenLeung commented 3 months ago

I was running single server H100 (8xH100 SXM) nccl-tests and saw that the Bus BW 480Gbyte/s even tho the line rate is 450Gbyte/s. I was confused and looked further into how bus BW is calcuated and it seems like it is calculated incorrectly for in network reduction algos.

According to https://github.com/NVIDIA/nccl-tests/issues/212#issuecomment-2210390757 , The acutal correction factor should be bus_bw = algo_bw * (n-1)/(n+1) instead of bus_bw = algo_bw * 2(n-1)/n

This PR is probably not mergable since NCCL_ALGO can be auto picked or be contained in /etc/nccl.conf and there doesn't seem to have an API for seeing what algo nccl has chose. Correction factors for CollnetDirect and CollnetChain on the IB Network probably needs to be updated too.

But just wanted to put it here in case anyone else in the community is confused about how bus bw could be 106% faster than peak theoretical line rate.

Command

NCCL_ALGO=NVLS ./build/all_reduce_perf -b 8K -e 8G -f 2 -g 8

Before

After

Factor vs number of ranks

NVLS read/write

sjeaugey commented 2 months ago

Sorry, my comment was incorrect. I fixed it. It's algobw = busbw * n / (n+1).

sjeaugey commented 2 months ago

Also note, the slide above is incorrect as well. It should read N-1 reads / N-1 writes in the left column, for a total of 2(N-1) send, 2(N-1) receive.

OrenLeung commented 2 months ago

Sorry, my comment was incorrect. I fixed it. It's algobw = busbw * n / (n+1).

@sjeaugey thank for the clarification. For NVLSTree, what would the correction factor be?

sjeaugey commented 2 months ago

Well, this is where things get complicated. NVLSTree is using NVLS intra-node, but we use Tree inter-node. Tree is near-bandwidth-optimal, it exchanges 2×size instead of 2×(n-1)/n×size, except for 2 nodes in which case it only exchanges size. So now we have a mix of intra-node NVLS and inter-node Tree, and the performance will depend on whichever is the bottleneck. On 2 nodes it will be NVLS, on 4+ nodes it will be Tree.

And things can get worse. In case of intra-node + inter-node, part of the intra-node traffic may be lightened because the inter-node part plays the role of one of the intra-node steps, meaning things are going to be really complicated to compute. That's the case for rings, which are limited to 370 GB/s intra-node but when combining intra+inter node, we only perform 7/8 steps, hence the network becomes the bottleneck at 395GB/s. [That being said, as we limit ourselves to 16 SMs to limit SM and memory usage, we won't reach that peak BW with default settings].

That's why trying to track how much bandwidth is going through each NVLink, PCI link, Network port is a very complex task, and not something you can easily reflect in a benchmark.

The notion of BusBW, as we defined it, is a theoretical correction factor, based on what's needed to communicate between ranks when using point-to-point transfers. When using point-to-point communication, it gives a constant target as we scale instead of degrading, similarly to the broadcast operation which always had a natural notion of bandwidth.

But except in simple cases like rings on a flat homogeneous topology, it does not really reflect the "bus" bandwidth (which isn't surprising given there are many different buses). It may reflect some mix-of-speeds of the different buses, and in the case of accelerators like SHARP it doesn't mean much anymore, since the algo bandwidth is now what should be constant at scale. But when we combine SHARP with non-SHARP, if the non-SHARP becomes the bottleneck, then it may make sense again.

So you can consider the "Bus BW" as another bandwidth computation with a correction factor to make more sense in some cases. We can still compare the BusBW or ring vs NVLS to see how much faster one is versus the other. When NVLS gets 480GB/s BusBW on 8 GPUs, it means that you would need 480GB/s of NVLink bandwidth to get the same performance with a Ring or Alltoall algorithm.

Hope that explains what the goal of the "BusBw" is, and why we don't try to improve NCCL perf tests to reflect the real bandwidth of all buses.

NVIDIA / nccl-tests