Open ziannchen opened 1 year ago
Operations within a group will be a single kernel. Operations will be executed in parallel, up to 8 operations per p2p channels at a time, in multiple rounds if necessary, following a schedule which avoids deadlocks.
Does this aggregation cost more SM than call them one by one in different groups? And, will these operations use different SM or share some SM but in different time?
If you run operations one at a time, they will consume at least one SM on each launch (possibly more). If run inside a group, we'll pack up to 8 operations per SM and run all SMs in parallel, so aggregation should cost way less in terms of SMs and Kernel launches.
I also found this issue https://github.com/NVIDIA/nccl/issues/665. I can understand how it works inside SM in sendrecv. But I still have some questions. will coll and p2p be aggreagted in the same SM? I guess https://github.com/NVIDIA/nccl/blob/0b083e52096c387bad7a5c5c65b26a9dca54de8c/src/collectives/device/common.h#L79 this line of code determines how to make different warps in SM perform different operations, but I don't see logic about P2P here.
If I call multiple communication operations between ncclGroupStart/End, will these operations be processed in parallel and using more SM? Does this kind of aggregation has upper limit? Or nccl just merge them into one kernel and executed serially?