NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
809 stars 229 forks source link

`busbw` does not reflect the speed of hardware bottleneck in H800 #153

Open zhangmenghao opened 1 year ago

zhangmenghao commented 1 year ago

@sjeaugey Hi, Sylvain, I found since the introduce of NVLS in NCCL 2.17.1 and NCCL 2.18.1, busbw does not well reflect the hardware bottleneck in H800. For example, on two H800 GPU server, each server are equipped with 8 H800 GPU and 8 100Gbps Mellanox CX-6, the busbw can reach as high as 161Gbps. This is because nccl uses NVLS algorithm and does some kind of aggregation intra-server. image As a result, from my prospective, I think the double factor = ((double)(2*(nranks - 1)))/((double)nranks); in nccl-tests/src /all_reduce.cu should be set based on the algorithm all-reduce selects. For tree and ring, as you said in nccl-tests/doc /PERFORMANCE.md, the factor should be 2(n-1)/n. However, for NVLS or COLLNET, the factor should be revised.

How do you think about this?

sjeaugey commented 1 year ago

It's not new. The notion of BusBw was invented to compensate from the fact that the amount of data transmitted by an allreduce implementation based on point-to-point communication is not constant with the number of GPUs. So, it should be understood as bandwidth going through wires on a flat system where we implement allreduce based on point-to-point operations.

Now, systems are not flat (they're hierarchical, with NVLink +IB), not all algorithms transmit exactly 2(N-1)/NS (tree for example transmits 2*S), and some algorithms now use network-accelerated techniques (SHARP) which break the "point-to-point based" rule.

So indeed with SHARP enabled and depending on what is the bottleneck, AlgBW may make more sense to interpret the numbers, and BusBW may not represent any "Bus" bandwidth.

The notion of BusBw, even though the "Bus" in the name is no longer that great, can still be used to compare a system against another simple system with no acceleration. It is just AlgBw with a ratio, after all.

zhangmenghao commented 1 year ago

Got it, thank you very much!

zhangmenghao commented 1 year ago

It's not new. The notion of BusBw was invented to compensate from the fact that the amount of data transmitted by an allreduce implementation based on point-to-point communication is not constant with the number of GPUs. So, it should be understood as bandwidth going through wires on a flat system where we implement allreduce based on point-to-point operations.

Now, systems are not flat (they're hierarchical, with NVLink +IB), not all algorithms transmit exactly 2*(N-1)/N_S (tree for example transmits 2_S), and some algorithms now use network-accelerated techniques (SHARP) which break the "point-to-point based" rule.

So indeed with SHARP enabled and depending on what is the bottleneck, AlgBW may make more sense to interpret the numbers, and BusBW may not represent any "Bus" bandwidth.

The notion of BusBw, even though the "Bus" in the name is no longer that great, can still be used to compare a system against another simple system with no acceleration. It is just AlgBw with a ratio, after all.

Anyway, I still believe it would be much better if there are some metrics that could reflect the speed of the hardware bottleneck (usually network, PCIe or QPI), regardless of the concrete collective communication algorithm nccl implements. In this way, user can use these metrics to determine wether the hardware (e.g., network settings, pcie parameters, nccl parameters) are correctly set.

Do you have any plan to introduce such metrics?

zhangmenghao commented 1 year ago

One more question, is there any approach we can use to get the concrete collective communication algorithm for one specific collective communication in NCCL, e.g., whether tree, ring, NVLS or a combination of them is used in one all-reduce operation? Does NCCL output some logs for such choices?

sjeaugey commented 1 year ago

it would be much better if there are some metrics that could reflect the speed of the hardware bottleneck

Agreed, but that is not something the NCCL perf tests can do. BusBw is, in a way, universal, as it is based on a minimal amount of traffic in a p2p-based algorithm; and it applies to ring, direct, or even recursive doubling algorithms (all algorithms which are bandwidth-optimal).

With hierarchical topologies and the multiplication of algorithms, things become more complicated -- it's unfortunate but also tied to the fact that things move quickly.

is there any approach we can use to get the concrete collective communication algorithm for one specific collective communication in NCCL

There is a printf you can uncomment here or a TRACE here if you compile with tracing enabled. Of course this is very verbose so it's not something you'd want to enable in production.

zhangmenghao commented 1 year ago

It's not new. The notion of BusBw was invented to compensate from the fact that the amount of data transmitted by an allreduce implementation based on point-to-point communication is not constant with the number of GPUs. So, it should be understood as bandwidth going through wires on a flat system where we implement allreduce based on point-to-point operations.

Now, systems are not flat (they're hierarchical, with NVLink +IB), not all algorithms transmit exactly 2*(N-1)/N_S (tree for example transmits 2_S), and some algorithms now use network-accelerated techniques (SHARP) which break the "point-to-point based" rule.

So indeed with SHARP enabled and depending on what is the bottleneck, AlgBW may make more sense to interpret the numbers, and BusBW may not represent any "Bus" bandwidth.

The notion of BusBw, even though the "Bus" in the name is no longer that great, can still be used to compare a system against another simple system with no acceleration. It is just AlgBw with a ratio, after all.

@sjeaugey Hi, Sylvain, in the reply above, you said the data transmit for one GPU in Tree-based all-reduce algorithm is 2S (except for the root which is just S). If so, is there something wrong in nccl-tests/doc/PERFORMANCE.md? For Tree-based all-reduce algorithm, should the algbw/busbw factor be 2 instead of 2(n-1)/n?

I am looking forward to hearing from you. Thank you very much!

sjeaugey commented 1 year ago

The formula for the busbw is based on the minimal theoretical amount of data that needs to be transmitted between ranks on a flat network. Ring, among others (all-to-all, recursive doubling, ..) does transmit that exact amount, so the busbw does reflect the real link bandwidth (in most cases).

Tree does not, although at large scale it does approach it (2(n-1)/n converges to 2). So, because the Tree algorithm is not exactly optimal for allreduce, the busbw isn't reflecting the real bandwidth on the links (you should take algbw*2 instead) but it is still useful as a comparison to the theoretical peak of a perfect algorithm on a flat network.

I agree the "busbw" should be given another name as a "rectified bandwidth to account for the minimal amount of data that needs to be transmitted between ranks using point-to-point communication on a flat network". In simple cases and optimal algorithms, it does reflect the bus bandwidth, which is why it was called that way initially.