[Question] Why ncclSend is non-blocking?

I'm doing some communication synchronization tests recently and found that send in PyTorch uses nccl backend as non blocking. It seems that the send CUDA kernel is not blocking the execution and waiting for the peer receive operator. The recv kernel is blocking. This is inconsistent with the description in Pytorch and nccl documentation. What's the reason?

Reproduce examples (I am trying to create a test case to trigger a deadlock in communication. Gloo will hang at send call, NCCL backend will hang at recv call.):

import torch
import torch.distributed as dist
from argparse import ArgumentParser

if __name__ == "__main__":
    parser = ArgumentParser()
    parser.add_argument("--rank", type=int)
    parser.add_argument("--backend", type=str)
    args = parser.parse_args()
    rank = args.rank
    local_rank, remote_rank = args.rank, 1 - args.rank
    device = torch.device('cuda', local_rank)
    torch.cuda.set_device(device)

    if args.backend == "nccl":
        dist.init_process_group(backend="nccl", init_method='tcp://127.0.0.1:30001', rank=local_rank, world_size=2)
    else:
        dist.init_process_group(backend="gloo", init_method='tcp://127.0.0.1:30001', rank=local_rank, world_size=2)

    dist.barrier()
    print("ready")
    tensor_size = (int(500 * 1024 * 1024))
    tensor_to_send = torch.ones(tensor_size).to(rank)
    tensor_to_recv = torch.ones(tensor_size).to(rank)

    if rank == 0:
        print("before send")
        dist.send(tensor_to_send, remote_rank)
        print("after send")
        torch.cuda.synchronize(rank) 
        print("before recv")
        dist.recv(tensor_to_recv, remote_rank)
        print("after recv")
        torch.cuda.synchronize(rank) 
    if rank == 1:
        print("before send")
        dist.send(tensor_to_send, remote_rank)
        print("after send")
        torch.cuda.synchronize(rank) 
        print("before recv")
        dist.recv(tensor_to_recv, remote_rank)
        print("after recv")
        torch.cuda.synchronize(rank)

    print("done")

NVIDIA / nccl

[Question] Why ncclSend is non-blocking? #1456