H100 all reduce performance is poor

liminn commented 3 months ago

We test the all_reduce_perf in H100, the algbw is about 250GB/S. But NV officials claim that the bandwidth of all reduce can reach 450GB/S. all_reduce_perf in H100, about 250GB/S: WechatIMG19

NV officials 450GB/S: WechatIMG16

Why is the measured all reduce bandwidth much smaller than the officially claimed bandwidth by NV?

sjeaugey commented 3 months ago

450GB/s is the line rate. With the protocol overhead (~20%) NCCL can get an effective NVLink bandwidth of 370GB/s, which you can observe as "BusBw" when running non-allreduce operations, or when running with NCCL_ALGO=RING.

Now allreduce uses NVLink SHARP by default, which accelerates the allreduce operation. With NVLink SHARP neither the algbw nor busbw correspond to the bandwidth on the wire though. On 8 GPUs, you should reach a peak busBw of ~480GB/s (algbw ~275GB/s). You need to run on larger sizes though; 256M is too small to reach peak bandwidth.

liminn commented 3 months ago

@sjeaugey Thank you for your reply. I still have some questions: (1) The "450GB/s AllReduce BW" in the figure actually means busbw, not algbw, right? (2) NV claims that after using SHARP, the bandwidth can be doubled, but why did you say "On 8 GPUs, you should reach a peak busBW of ~480GB/s(algbw ~275GB/s)"? I didn't see that SHARP has the effect of doubling the bandwidth.

截屏2024-05-06 19 29 38

sjeaugey commented 3 months ago

Thanks for pointing this to my attention. The slides seem wrong. I'll try to get them fixed.

LearnigF commented 3 months ago

hello，About “ With the protocol overhead (~20%) NCCL can get an effective NVLink bandwidth of 370GB/s,”， What does "protocol" refer to? Is it NCCL_PROTO?

sjeaugey commented 3 months ago

No, it's just the NVLink protocol (difference between wire speed and effective SM-level load/store speed through NVLink).

LearnigF commented 3 months ago

so，Can I assume that these latencies are inevitable and cannot be further optimized in software?

LearnigF commented 3 months ago

And，Can I interpret this 20% performance loss as data encoding loss in the NVLink transport layer?Looking forward to your reply.

sjeaugey commented 3 months ago

Yes indeed.

LearnigF commented 2 months ago

Thank you very much. Is there any official documentation describing the issue of performance loss at the transport layer? We have always believed that 450GB/s is the real single-direction bandwidth between cards

hpettyiii commented 2 months ago

Its says 2x EFFECTIVE bandwidth. So SHARP operations cut in half the need to send data for collective operations. So you only need to send half as much data or your bandwidth has EFFECTIVELY doubled.

QiuBiuBiu commented 2 months ago

We test the all_reduce_perf in H100, the algbw is about 250GB/S. But NV officials claim that the bandwidth of all reduce can reach 450GB/S. all_reduce_perf in H100, about 250GB/S:

NV officials 450GB/S:

Why is the measured all reduce bandwidth much smaller than the officially claimed bandwidth by NV?

@liminn Curious if this performance data is a result of the RING or NVLS algorithm on H100? What can be the peak bandwidth (algo & bus) when the data size becomes 1g or larger.

dearsxx0918 commented 1 month ago

Hi sjeaugey, I think nccl-tests bus bandwidth is not correct for NVLS. The relationship between algbw and busbw for ring is algbw = busbw n / (2 (n - 1)), for NVLS is algbw = busbw * n / ( n -1). Users are more care about algbw improvement since this will affect their E2E performance. Busbw only reflect the hardware link effective bandwidth. So NV should clarify this since caused much misunderstanding on E2E performance. Best regards, -Edda

sjeaugey commented 1 month ago

The NVLS formula would actually be algbw = busbw * (n+1) / (n-1).

But the NCCL perf tests do not look inside NCCL. They're sitting on top of NCCL, and as such they can't know what the topology is, nor which algorithm/mechanism NCCL is using.

The BusBw was added as a theoretical correction factor over the algorithm bandwidth, to account for the fact that some operations need to exchange more than the size when based on point-to-point exchanges. It also allowed to have a bandwidth that would not change as we scale, and which we could compare to HW values.

But with hierarchical algorithms or hardware-accelerated algorithms, it becomes less and less of a "Bus" bandwidth more like a "Corrected algorithm bandwidth" and became harder to interpret. For hardware-accelerated algorithms, the algorithm bandwidth is the actual bus bandwidth, and the "BusBW" is no longer reflecting any real bandwidth on the system.

It is still a value to compare against algorithms though. Basically, when running with NVLS and getting 480GB/s, the BusBW tells you what NVLink bandwidth you would need to get the same performance if you didn't have NVLS. So you can see that number as the "Bus bandwidth you would need if you connected all GPUs through a flat network and they could only communicate through Send/Recv operations".

NVIDIA / nccl-tests

H100 all reduce performance is poor #212