A question about sequences of functions called in nccl/src/transport /net.cc

Hello! I have the following question: Suppose we have only one GPU in a node and the GPU needs to perform RecvReduceSend() or RecvCopySend(). I think NCCL will perform the following steps:

Step1: First call NetIrecv() and RecvNetTest() here to receive the data https://github.com/NVIDIA/nccl/blob/2ea4ee94bfb04c886c79ccae60ac9961000fdee2/src/transport/net.cc#L1320

Step 2: If the return value done in the first step is true, we can then perform Reduce or Copy in GPU.

Step 3: After the data is moved to the network buffer from GPU, we can call NetIsend() here to send the data to the next node https://github.com/NVIDIA/nccl/blob/2ea4ee94bfb04c886c79ccae60ac9961000fdee2/src/transport/net.cc#L1132

I think these three steps need to be performed sequentially, which means NetIsend() in step 3 needs to be called later than the corresponding RecvNetTest() in step 1. And I used Nsight Systems to trace the functions described above and found that for Simple protocol, this is the case. But for LL/LL128, this is not always the case. Also, loading RDMA module may make more NetIsend() called prior to the corresponding RecvNetTest(). I wonder is the phenomenon normal or abnormal.

Thanks a lot for any help!

NVIDIA / nccl

A question about sequences of functions called in nccl/src/transport /net.cc #1493