NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.09k stars 778 forks source link

NCCL WARN NET/Socket : message truncated in PyTorch multiple machines and multiple GPUs #1203

Open ratikapoor opened 6 months ago

ratikapoor commented 6 months ago

The training is getting stuck after epoch 1. I have set NCCL_DEBUG_SUBSYS: COLL, NCCL_DEBUG: INFO and NCCL_PROTO: Simple

Below is the log that got generated:

5dc717e6572448e4a7a20d95a57964b8000005:40:40 [3] NCCL INFO AllReduce: opCount 12c8 sendbuff 0x1544dbdfec00 recvbuff 0x1544dbdfec00 count 1 datatype 7 op 0 root 0 comm 0x44b24930 [nranks=8] stream 0x91b5610 5dc717e6572448e4a7a20d95a57964b8000005:38:38 [1] NCCL INFO AllReduce: opCount 12c7 sendbuff 0x15032b9fb400 recvbuff 0x15032b9fb400 count 1 datatype 7 op 0 root 0 comm 0x4364eff0 [nranks=8] stream 0x94cc960 5dc717e6572448e4a7a20d95a57964b8000005:38:38 [1] NCCL INFO AllReduce: opCount 12c8 sendbuff 0x15032b9fba00 recvbuff 0x15032b9fba00 count 1 datatype 7 op 0 root 0 comm 0x4364eff0 [nranks=8] stream 0x94cc960 5dc717e6572448e4a7a20d95a57964b8000005:39:39 [2] NCCL INFO AllReduce: opCount 12c7 sendbuff 0x15081bdfe200 recvbuff 0x15081bdfe200 count 1 datatype 7 op 0 root 0 comm 0x42c4bf00 [nranks=8] stream 0x89e0f90 5dc717e6572448e4a7a20d95a57964b8000005:39:39 [2] NCCL INFO AllReduce: opCount 12c8 sendbuff 0x15081bdff200 recvbuff 0x15081bdff200 count 1 datatype 7 op 0 root 0 comm 0x42c4bf00 [nranks=8] stream 0x89e0f90 5dc717e6572448e4a7a20d95a57964b8000005:37:37 [0] NCCL INFO AllReduce: opCount 12c7 sendbuff 0x152102ffe600 recvbuff 0x152102ffe600 count 1 datatype 7 op 0 root 0 comm 0x451b0890 [nranks=8] stream 0x45965060 5dc717e6572448e4a7a20d95a57964b8000005:37:37 [0] NCCL INFO AllReduce: opCount 12c8 sendbuff 0x15210d5f3e00 recvbuff 0x15210d5f3e00 count 1 datatype 7 op 0 root 0 comm 0x451b0890 [nranks=8] stream 0x45965060 5dc717e6572448e4a7a20d95a57964b8000005:40:40 [3] NCCL INFO AllGather: opCount 12c9 sendbuff 0x1544db9ffe00 recvbuff 0x15453d7ffe00 count 4 datatype 0 op 0 root 0 comm 0x44b24930 [nranks=8] stream 0x91b5610 5dc717e6572448e4a7a20d95a57964b8000005:39:39 [2] NCCL INFO AllGather: opCount 12c9 sendbuff 0x15081bdff200 recvbuff 0x150906bed400 count 4 datatype 0 op 0 root 0 comm 0x42c4bf00 [nranks=8] stream 0x89e0f90 5dc717e6572448e4a7a20d95a57964b8000005:38:38 [1] NCCL INFO AllGather: opCount 12c9 sendbuff 0x15032b9fba00 recvbuff 0x15032bdffe00 count 4 datatype 0 op 0 root 0 comm 0x4364eff0 [nranks=8] stream 0x94cc960 5dc717e6572448e4a7a20d95a57964b8000005:37:37 [0] NCCL INFO AllGather: opCount 12c9 sendbuff 0x15210d5f3e00 recvbuff 0x1522055faa00 count 4 datatype 0 op 0 root 0 comm 0x451b0890 [nranks=8] stream 0x45965060

spotluri commented 5 months ago

@ratikapoor please review previous reports of similar issue.

https://github.com/NVIDIA/nccl/issues/626 https://github.com/NVIDIA/nccl/issues/268

as suggested, from the logs confirm NCCL_PROTO=simple is being propagated to all ranks. confirm that all ranks are calling collectives with the same size on all ranks