NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.24k stars 817 forks source link

Why "Enable LL128 by default only on Volta/Ampere/Hopper+NVLink"? #786

Open mtxuhao opened 1 year ago

mtxuhao commented 1 year ago

Hi, NCCL teamers:

Why "Enable LL128 by default only on Volta/Ampere/Hopper+NVLink"? the root reason? thx https://github.com/NVIDIA/nccl/blob/f3d51667838f7542df8ea32ea4e144d812b3ed7c/src/graph/tuning.cc#L229

sjeaugey commented 1 year ago

Hi, That is because LL128 relies on the assumption that a 128B store will reach the other GPU in ascending address order, which is quite fragile.

Therefore we only enable it on platforms where we have verified that all the chain was giving that guarantee, being conservative as we don't want our users to experience silent data corruption.

If you're brave, you can enable it on non-supported platforms with NCCL_PROTO=LL,LL128,SIMPLE. No guarantees it won't hurt you one day though...

mtxuhao commented 1 year ago

confused: "all chain was guarantee", what is the chain? thx

sjeaugey commented 1 year ago

Sorry that was unclear. For GPUs on the same node, that means the path between the two GPU SMs: GPU memory system, NVLink, and NVSwitch. For GPUs on different nodes, that means the GPU PCI interface, the PCI Switches, the NICs, and the fabric. At each step we need to make sure the 128 bytes won't be split and then reordered, causing us to see the flag at the end be updated while data before that would not be updated yet.

mtxuhao commented 1 year ago

thx very much close the issue