Open mtxuhao opened 1 year ago
Hi, That is because LL128 relies on the assumption that a 128B store will reach the other GPU in ascending address order, which is quite fragile.
Therefore we only enable it on platforms where we have verified that all the chain was giving that guarantee, being conservative as we don't want our users to experience silent data corruption.
If you're brave, you can enable it on non-supported platforms with NCCL_PROTO=LL,LL128,SIMPLE
. No guarantees it won't hurt you one day though...
confused: "all chain was guarantee", what is the chain? thx
Sorry that was unclear. For GPUs on the same node, that means the path between the two GPU SMs: GPU memory system, NVLink, and NVSwitch. For GPUs on different nodes, that means the GPU PCI interface, the PCI Switches, the NICs, and the fabric. At each step we need to make sure the 128 bytes won't be split and then reordered, causing us to see the flag at the end be updated while data before that would not be updated yet.
thx very much close the issue
Hi, NCCL teamers:
Why "Enable LL128 by default only on Volta/Ampere/Hopper+NVLink"? the root reason? thx https://github.com/NVIDIA/nccl/blob/f3d51667838f7542df8ea32ea4e144d812b3ed7c/src/graph/tuning.cc#L229