NVIDIA / nccl

Optimized primitives for collective multi-GPU communication

Other

3.25k stars 821 forks source link

[BUG] NCCL2.20.5 meets "Message truncated : received 1024 bytes instead of 256" error while 2.18.5 not #1273

Open shh2000 opened 6 months ago

shh2000 commented 6 months ago

env

env1: 3 HGX-H100 with totally 24 GPUs. Same baremetal hardwares&envs with nvidia driver 535.129.03
env2: 3 HGX-A100 with totally 24 GPUs. Same baremetal hardwares&envs with nvidia driver 470.141.10

test code


import torch.distributed as dist

if __name__ == "__main__":

    dist.init_process_group(backend="nccl")
    rank = dist.get_rank()
    world_size = dist.get_world_size()
    group = dist.new_group([13,15,17,19,21])
    print(rank)
    print(world_size)

    a=torch.tensor([1,2]).to(rank%8)
    print(a)
    dist.all_reduce(a,group=group,op=dist.ReduceOp.SUM)

    #dist.all_reduce(a,op=dist.ReduceOp.SUM)
    dist.barrier()

    dist.barrier()
    print(a)
    dist.destroy_process_group()```

# result

1. env1 with ngctorch24:03(nccl 2.20.5) failed with "Message truncated : received 1024 bytes instead of 256"
2. env1 with ngctorch23:09(nccl 2.20.5) passed
3. env2 with ngctorch24:03 passed
4. env2 with ngctorch23:09 passed

sjeaugey commented 6 months ago

Can you provide the log with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH,INIT,ENV for the run on env1 which failed?

shh2000 commented 6 months ago

Additional information:

when using 4 HGX-H100 with totally 32 GPUs, all nccl version passed
when using 2 or 4 HGX-H100, using megatron:core_v0.6.0 to train some 70B/45B/xxxB LLM, passed. MFU=45% which is reasonable in "mcore training dense LLM on HGX-H100" context
all HGX-H100 passed nccl-test respectively(all-reduce-perf results is reasonable)

shh2000 commented 6 months ago

Can you provide the log with NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=GRAPH,INIT,ENV for the run on env1 which failed?

Sure, i'll run with your envs and provide logs, thanks

shh2000 commented 6 months ago

@sjeaugey here's the log of noderank1(rank8-15). noderank1_log.txt

kiskra-nvidia commented 6 months ago

We've fixed a similar-looking bug in NCCL 2.21; can you try with the latest version?

shh2000 commented 6 months ago

@kiskra-nvidia Thanks for the information. we may try ngctorch:2404 or some other ways to upgrade nccl 2.21+.

By the way, is there any publicly disclosable reasons about nccl 2.20's bug?(Not for the problem itself, but out of curiosity for the technology involved). I find and guess maybe 2.20 found wrong MPI paths in some cases(drivers+nnodes+topo)? look forward to your reply and thanks!

sjeaugey commented 6 months ago

Actually I'm not sure upgrading will help. The bug was a mixup of the connect with the following barrier and the barrier size was 8 bytes. Here all your sizes are more than 8.

The log you provided only shows one node. Could it be your environment was not forwarded to the other node? That would also explain the crash, as the other node might have a different configuration ending up in a mismatch and a discrepancy in sizes we're trying to exchange.

shh2000 commented 6 months ago

@sjeaugey Hi, my 3 nodes has the same baremetal config(8 H100+4 activated(8 in all) HDR NIC+2 CPU+PCIE5), with containers run from the same images(ngctorch2403+megatroncore0.6.0). If your guess is true, can my bug be reproduced by testing the P2P between every two ranks(C_{24}^2=24*23/2=276 cases)? By the way, if 276 p2p comm is all ok, would it face bug when using specific 5 rank to do all-reduce?