NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.24k stars 815 forks source link

[Question] Why ncclSend is non-blocking? #1456

Open YanjieGao opened 1 month ago

YanjieGao commented 1 month ago

I'm doing some communication synchronization tests recently and found that send in PyTorch uses nccl backend as non blocking. It seems that the send CUDA kernel is not blocking the execution and waiting for the peer receive operator. The recv kernel is blocking. This is inconsistent with the description in Pytorch and nccl documentation. What's the reason?

Image

Image

Image

Reproduce examples (I am trying to create a test case to trigger a deadlock in communication. Gloo will hang at send call, NCCL backend will hang at recv call.):

import torch
import torch.distributed as dist
from argparse import ArgumentParser

if __name__ == "__main__":
    parser = ArgumentParser()
    parser.add_argument("--rank", type=int)
    parser.add_argument("--backend", type=str)
    args = parser.parse_args()
    rank = args.rank
    local_rank, remote_rank = args.rank, 1 - args.rank
    device = torch.device('cuda', local_rank)
    torch.cuda.set_device(device)

    if args.backend == "nccl":
        dist.init_process_group(backend="nccl", init_method='tcp://127.0.0.1:30001', rank=local_rank, world_size=2)
    else:
        dist.init_process_group(backend="gloo", init_method='tcp://127.0.0.1:30001', rank=local_rank, world_size=2)

    dist.barrier()
    print("ready")
    tensor_size = (int(500 * 1024 * 1024))
    tensor_to_send = torch.ones(tensor_size).to(rank)
    tensor_to_recv = torch.ones(tensor_size).to(rank)

    if rank == 0:
        print("before send")
        dist.send(tensor_to_send, remote_rank)
        print("after send")
        torch.cuda.synchronize(rank) 
        print("before recv")
        dist.recv(tensor_to_recv, remote_rank)
        print("after recv")
        torch.cuda.synchronize(rank) 
    if rank == 1:
        print("before send")
        dist.send(tensor_to_send, remote_rank)
        print("after send")
        torch.cuda.synchronize(rank) 
        print("before recv")
        dist.recv(tensor_to_recv, remote_rank)
        print("after recv")
        torch.cuda.synchronize(rank)

    print("done")
YanjieGao commented 1 month ago

Is the design of send related to the MPI buffered mode send operation? But this is inconsistent with the nccl documentation describing the synchronization semantics of send.

Image

https://www.mpi-forum.org/docs/mpi-2.2/mpi22-report/node53.htm