NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.25k stars 825 forks source link

Questions about nccl Group #1028

Open ziannchen opened 1 year ago

ziannchen commented 1 year ago

If I call multiple communication operations between ncclGroupStart/End, will these operations be processed in parallel and using more SM? Does this kind of aggregation has upper limit? Or nccl just merge them into one kernel and executed serially?

sjeaugey commented 1 year ago

Operations within a group will be a single kernel. Operations will be executed in parallel, up to 8 operations per p2p channels at a time, in multiple rounds if necessary, following a schedule which avoids deadlocks.

ziannchen commented 1 year ago

Does this aggregation cost more SM than call them one by one in different groups? And, will these operations use different SM or share some SM but in different time?

sjeaugey commented 1 year ago

If you run operations one at a time, they will consume at least one SM on each launch (possibly more). If run inside a group, we'll pack up to 8 operations per SM and run all SMs in parallel, so aggregation should cost way less in terms of SMs and Kernel launches.

ziannchen commented 1 year ago

I also found this issue https://github.com/NVIDIA/nccl/issues/665. I can understand how it works inside SM in sendrecv. But I still have some questions. will coll and p2p be aggreagted in the same SM? I guess https://github.com/NVIDIA/nccl/blob/0b083e52096c387bad7a5c5c65b26a9dca54de8c/src/collectives/device/common.h#L79 this line of code determines how to make different warps in SM perform different operations, but I don't see logic about P2P here.