NVIDIA / nccl-tests

NCCL Tests
BSD 3-Clause "New" or "Revised" License
775 stars 226 forks source link

How to explain Bus Bandwidth in Allreduce Operation? #197

Closed HydraQYH closed 22 hours ago

HydraQYH commented 5 months ago

In https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md, we not that "we need n-1 additions and n assignments for each element. Since every step is on a different rank except potentially one (the last input and the first output), we need 2(n-1) data transfers (x number of elements) to perform an allReduce operation.".

Why there is 2(n-1) data transfers? Additions will cause (n-1) data transfers, and n assignments will also cause (n-1) data transfers. One of n assignments is an assignment within a single GPU, if we do allreduce in-place, This time the assignment doesn't even exist. The Ring Allreduce Algorithm also will cause (2n-2) data transfers.