Hello! I have the following question: Suppose we have only one GPU in a node and the GPU needs to perform RecvReduceSend() or RecvCopySend(). I think NCCL will perform the following steps:
I think these three steps need to be performed sequentially, which means NetIsend() in step 3 needs to be called later than the corresponding RecvNetTest() in step 1. And I used Nsight Systems to trace the functions described above and found that for Simple protocol, this is the case. But for LL/LL128, this is not always the case. Also, loading RDMA module may make more NetIsend() called prior to the corresponding RecvNetTest(). I wonder is the phenomenon normal or abnormal.
Hello! I have the following question: Suppose we have only one GPU in a node and the GPU needs to perform
RecvReduceSend()
orRecvCopySend()
. I think NCCL will perform the following steps:Step1: First call
NetIrecv()
andRecvNetTest()
here to receive the data https://github.com/NVIDIA/nccl/blob/2ea4ee94bfb04c886c79ccae60ac9961000fdee2/src/transport/net.cc#L1320Step 2: If the return value
done
in the first step is true, we can then performReduce
orCopy
in GPU.Step 3: After the data is moved to the network buffer from GPU, we can call
NetIsend()
here to send the data to the next node https://github.com/NVIDIA/nccl/blob/2ea4ee94bfb04c886c79ccae60ac9961000fdee2/src/transport/net.cc#L1132I think these three steps need to be performed sequentially, which means
NetIsend()
instep 3
needs to be called later than the correspondingRecvNetTest()
instep 1
. And I used Nsight Systems to trace the functions described above and found that for Simple protocol, this is the case. But for LL/LL128, this is not always the case. Also, loading RDMA module may make moreNetIsend()
called prior to the correspondingRecvNetTest()
. I wonder is the phenomenon normal or abnormal.Thanks a lot for any help!