NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.28k stars 829 forks source link

NCCL error "receiving 524288 bytes instead of 65536" #1328

Closed JingyuQian closed 4 months ago

JingyuQian commented 5 months ago

Hello,

I'm looking at this problem of NCCL. Similar problems have been posted (like #626 ) and I've tried the suggestions but with no solution so far.

Environment

  1. 2 nodes, NVIDIA A5000 x 8 on each node;
  2. Custom-built docker on both ndoes, ubuntu 20.04, NCCL 2.20.5+cuda12.4. Base image is nvidia/cuda:12.1.1-devel-ubuntu20.04. No additional changes to NCCL;
  3. Pytorch 2.3.1+cu121

Context I tried multi-node training with model A, it works fine. Then I tried the same setting with model B (same repo, different config) and faced this error.

It looks like opCount c starts to produce this error.

Also, I tried NCCL_PROTO=SIMPLE and then the program raises torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory. After looking into it, the master broadcasts a single-bool tensor of size 4 (Tensor([4], device='cuda'), but the broadcast received by worker appears to be very large and when it uses that to initialize a tensor on GPU, the error raises.

Rank 0 log

NCCL version 2.20.5+cuda12.4
us-gpu012:423:423 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7fe1af9dea00 recvbuff 0x7fe1af9dea00 coun$
 8 datatype 0 op 0 root 0 comm 0x169dbe50 [nranks=2] stream 0x168fc370
us-gpu012:423:423 [0] NCCL INFO Broadcast: opCount 1 sendbuff 0x7fe1af9dc600 recvbuff 0x7fe1af9dc600 coun$
 116 datatype 0 op 0 root 0 comm 0x169dbe50 [nranks=2] stream 0x168fc370
us-gpu012:423:423 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7fe1af9de800 recvbuff 0x7fe1af9de800 coun$
 1 datatype 1 op 0 root 0 comm 0x169dbe50 [nranks=2] stream 0x168fc370
us-gpu012:423:423 [0] NCCL INFO Broadcast: opCount 3 sendbuff 0x7fe1af9e0000 recvbuff 0x7fe1af9e0000 count
 8 datatype 0 op 0 root 0 comm 0x169dbe50 [nranks=2] stream 0x168fc370
us-gpu012:423:423 [0] NCCL INFO Broadcast: opCount 4 sendbuff 0x7fe1af9de800 recvbuff 0x7fe1af9de800 count
 126 datatype 0 op 0 root 0 comm 0x169dbe50 [nranks=2] stream 0x168fc370
us-gpu012:423:423 [0] NCCL INFO AllReduce: opCount 5 sendbuff 0x7fe1af9dc600 recvbuff 0x7fe1af9dc600 count
 1 datatype 1 op 0 root 0 comm 0x169dbe50 [nranks=2] stream 0x168fc370
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
us-gpu012:423:423 [0] NCCL INFO AllGather: opCount 6 sendbuff 0x7fe06b000000 recvbuff 0x7fe06b000600 count
 8 datatype 0 op 0 root 0 comm 0x169dbe50 [nranks=2] stream 0x168fc370
us-gpu012:423:423 [0] NCCL INFO Broadcast: opCount 7 sendbuff 0x7fe06b000600 recvbuff 0x7fe06b000600 coun$
 1328 datatype 0 op 0 root 0 comm 0x169dbe50 [nranks=2] stream 0x168fc370
us-gpu012:423:423 [0] NCCL INFO Broadcast: opCount 8 sendbuff 0x7fe1a2000000 recvbuff 0x7fe1a2000000 coun$
 106782320 datatype 0 op 0 root 0 comm 0x169dbe50 [nranks=2] stream 0x168fc370
us-gpu012:423:423 [0] NCCL INFO Broadcast: opCount 9 sendbuff 0x7fe06b000000 recvbuff 0x7fe06b000000 coun$
 704 datatype 0 op 0 root 0 comm 0x169dbe50 [nranks=2] stream 0x168fc370
us-gpu012:423:423 [0] NCCL INFO Broadcast: opCount a sendbuff 0x7fe06b000400 recvbuff 0x7fe06b000400 coun$
 11200 datatype 0 op 0 root 0 comm 0x169dbe50 [nranks=2] stream 0x168fc370
us-gpu012:423:423 [0] NCCL INFO Broadcast: opCount b sendbuff 0x7fe1af9e0000 recvbuff 0x7fe1af9e0000 coun$
 8 datatype 0 op 0 root 0 comm 0x169dbe50 [nranks=2] stream 0x168fc370
us-gpu012:423:423 [0] NCCL INFO Broadcast: opCount c sendbuff 0x7fe1af9dc600 recvbuff 0x7fe1af9dc600 coun$
 4 datatype 0 op 0 root 0 comm 0x169dbe50 [nranks=2] stream 0x168fc370

Rank 1 log

us-gpu010:422:422 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f225a3f5a00 recvbuff 0x7f225a3f5a00 count
8 datatype 0 op 0 root 0 comm 0x15c4ddb0 [nranks=2] stream 0xd4586a0
us-gpu010:422:422 [0] NCCL INFO Broadcast: opCount 1 sendbuff 0x7f225a3f7c00 recvbuff 0x7f225a3f7c00 count
116 datatype 0 op 0 root 0 comm 0x15c4ddb0 [nranks=2] stream 0xd4586a0
us-gpu010:422:422 [0] NCCL INFO AllReduce: opCount 2 sendbuff 0x7f225a3f7c00 recvbuff 0x7f225a3f7c00 count
1 datatype 1 op 0 root 0 comm 0x15c4ddb0 [nranks=2] stream 0xd4586a0
us-gpu010:422:422 [0] NCCL INFO Broadcast: opCount 3 sendbuff 0x7f225a3f7c00 recvbuff 0x7f225a3f7c00 count
8 datatype 0 op 0 root 0 comm 0x15c4ddb0 [nranks=2] stream 0xd4586a0
us-gpu010:422:422 [0] NCCL INFO Broadcast: opCount 4 sendbuff 0x7f225a3f7e00 recvbuff 0x7f225a3f7e00 count
126 datatype 0 op 0 root 0 comm 0x15c4ddb0 [nranks=2] stream 0xd4586a0
us-gpu010:422:422 [0] NCCL INFO AllReduce: opCount 5 sendbuff 0x7f225a3f5a00 recvbuff 0x7f225a3f5a00 count
1 datatype 1 op 0 root 0 comm 0x15c4ddb0 [nranks=2] stream 0xd4586a0
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
us-gpu010:422:422 [0] NCCL INFO AllGather: opCount 6 sendbuff 0x7f211f000000 recvbuff 0x7f211f000600 count
8 datatype 0 op 0 root 0 comm 0x15c4ddb0 [nranks=2] stream 0xd4586a0
us-gpu010:422:422 [0] NCCL INFO Broadcast: opCount 7 sendbuff 0x7f211f000600 recvbuff 0x7f211f000600 count
1328 datatype 0 op 0 root 0 comm 0x15c4ddb0 [nranks=2] stream 0xd4586a0
us-gpu010:422:422 [0] NCCL INFO Broadcast: opCount 8 sendbuff 0x7f224e000000 recvbuff 0x7f224e000000 count
98263664 datatype 0 op 0 root 0 comm 0x15c4ddb0 [nranks=2] stream 0xd4586a0
us-gpu010:422:422 [0] NCCL INFO Broadcast: opCount 9 sendbuff 0x7f211f000000 recvbuff 0x7f211f000000 count
584 datatype 0 op 0 root 0 comm 0x15c4ddb0 [nranks=2] stream 0xd4586a0
us-gpu010:422:422 [0] NCCL INFO Broadcast: opCount a sendbuff 0x7f211f000400 recvbuff 0x7f211f000400 count
11200 datatype 0 op 0 root 0 comm 0x15c4ddb0 [nranks=2] stream 0xd4586a0
us-gpu010:422:422 [0] NCCL INFO Broadcast: opCount b sendbuff 0x7f225a3f5a00 recvbuff 0x7f225a3f5a00 count
8 datatype 0 op 0 root 0 comm 0x15c4ddb0 [nranks=2] stream 0xd4586a0
us-gpu010:422:776 [0] transport/net_socket.cc:488 NCCL WARN NET/Socket : peer 10.40.11.68<50802> message t$
uncated : receiving 524288 bytes instead of 65536. If you believe your socket network is in healthy state,
          there may be a mismatch in collective sizes or environment settings (e.g. NCCL_PROTO, NCCL_ALGO)
between ranks
us-gpu010:422:776 [0] NCCL INFO transport/net.cc:1298 -> 5
us-gpu010:422:776 [0] NCCL INFO proxy.cc:694 -> 5
us-gpu010:422:776 [0] NCCL INFO proxy.cc:874 -> 5 [Progress Thread]

us-gpu010:422:776 [0] transport/net_socket.cc:488 NCCL WARN NET/Socket : peer 10.40.11.68<50802> message tr
uncated : receiving 1031166432 bytes instead of 65536. If you believe your socket network is in healthy sta
te,           there may be a mismatch in collective sizes or environment settings (e.g. NCCL_PROTO, NCCL_AL
GO) between ranks
us-gpu010:422:776 [0] NCCL INFO transport/net.cc:1298 -> 5
us-gpu010:422:776 [0] NCCL INFO proxy.cc:694 -> 5
us-gpu010:422:776 [0] NCCL INFO proxy.cc:874 -> 5 [Progress Thread]

us-gpu010:422:776 [0] transport/net_socket.cc:488 NCCL WARN NET/Socket : peer 10.40.11.68<50802> message tr
uncated : receiving 1012892419 bytes instead of 65536. If you believe your socket network is in healthy sta
te,           there may be a mismatch in collective sizes or environment settings (e.g. NCCL_PROTO, NCCL_AL
GO) between ranks

Any insight would be helpful. Thank you.

sjeaugey commented 5 months ago

Looks like a version or configuration mismatch between two ranks. I'd advise to try again with 2.22 first. Then check the log for environment variables set on one rank but not the other.

If that doesn't bring anything, you may post the full log here and we can take a look.

JingyuQian commented 4 months ago

Probably something with my code. Haven't run into that since. I'll close it for now.