In https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md, we not that "we need n-1 additions and n assignments for each element. Since every step is on a different rank except potentially one (the last input and the first output), we need 2(n-1) data transfers (x number of elements) to perform an allReduce operation.".
Why there is 2(n-1) data transfers? Additions will cause (n-1) data transfers, and n assignments will also cause (n-1) data transfers. One of n assignments is an assignment within a single GPU, if we do allreduce in-place, This time the assignment doesn't even exist. The Ring Allreduce Algorithm also will cause (2n-2) data transfers.
In https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md, we not that
"we need n-1 additions and n assignments for each element. Since every step is on a different rank except potentially one (the last input and the first output), we need 2(n-1) data transfers (x number of elements) to perform an allReduce operation."
.Why there is 2(n-1) data transfers? Additions will cause (n-1) data transfers, and n assignments will also cause (n-1) data transfers. One of n assignments is an assignment within a single GPU, if we do allreduce in-place, This time the assignment doesn't even exist. The Ring Allreduce Algorithm also will cause (2n-2) data transfers.