Use nsight system to profile nccl p2p, I find something confused ...

I use nsight system to profile 4 processors and eash processor controls one GPU. There some p2p communicator between 4 GPUs.

From the picture above, acccording to nvtx, we can see that two red boxes are a pair of send recv between 2 GPUs, two green boxes are another pair of send recv between these two GPUs. But what confuses me is that the send kernel and the recv kernel in one pair of sendrecv don't overlap with each other, but the send kernel is far ahead of the recv kernel. Because in my imagination, one pair of send recv kernels should overlap with each other. And I try to find the root cause of this phenomenon. Firstly, I doubt that in nsight system, different GPU's kernel timelines may not be aligned with each other. So I execute a allreduce to check the alignment of all GPU timelines. Two blue boxes are allreduce kernels and they imply that the disalignment of GPU timelines may not be the root cause of this phenomenon.

Thus I want to know: 1 In the picture above, why one pair of send and recv kernels don't overlap with each other but send kernel is far ahead of recv kernel ?

Software Versions: pytorch 2.1.2+cuda11.8 nccl 2.21.5 (I use pybind to use a external nccl library.)

NVIDIA / nccl

Use nsight system to profile nccl p2p, I find something confused ... #1369