NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.11k stars 786 forks source link

NCCL Error on Multi-Node Training with Mixed GPU Setup #1366

Closed asdfry closed 1 month ago

asdfry commented 1 month ago

Description

Hello, I am testing multi-node training with three servers, each equipped with different GPUs (H1008, A404, L40S*4). During the process, I encountered an NCCL error and seek assistance to resolve this issue.

Environment

NCCL version: 2.20.5 PyTorch version: 2.3.1 CUDA version: 12.3 Operating System: Ubuntu 22.04.3 LTS Driver version: 535.161.08 Network Configuration: Kubernetes CNI

Commands Executed

Server 1: torchrun --nnodes=3 --nproc_per_node=8 --node_rank=0 --master_addr=10.244.2.207 --master_port=1040 train.py -b 1 -dn tldr_news -m bloom-560m Server 2: torchrun --nnodes=3 --nproc_per_node=4 --node_rank=1 --master_addr=10.244.2.207 --master_port=1040 train.py -b 1 -dn tldr_news -m bloom-560m Server 3: torchrun --nnodes=3 --nproc_per_node=4 --node_rank=2 --master_addr=10.244.2.207 --master_port=1040 train.py -b 1 -dn tldr_news -m bloom-560m

Error Logs

Server 1:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/root/train.py", line 172, in <module>
[rank1]:     model, optimizer, train_dataloader = accelerator.prepare(model, optimizer, train_dataloader)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1213, in prepare
[rank1]:     result = tuple(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1214, in <genexpr>
[rank1]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1094, in _prepare_one
[rank1]:     return self.prepare_model(obj, device_placement=device_placement)                                                            
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1349, in prepare_model                          
[rank1]:     model = torch.nn.parallel.DistributedDataParallel(
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 798, in __init__                         
[rank1]:     _verify_param_shape_across_processes(self.process_group, parameters)                                                         
[rank1]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 269, in _verify_param_shape_across_processes   
[rank1]:     return dist._verify_params_across_processes(process_group, tensors, logger)                                                  
[rank1]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, internal error - please report this issue to the NCCL developers, NCCL version 2.20.5
[rank1]: ncclInternalError: Internal check failed.                                                                                        
[rank1]: Last error:                                                 
[rank1]: ncclSocketInit: connecting to address  with family 1 is neither AF_INET(2) nor AF_INET6(10)

Server 2, 3:

[rank15]: Traceback (most recent call last):
[rank15]:   File "/root/train.py", line 172, in <module>
[rank15]:     model, optimizer, train_dataloader = accelerator.prepare(model, optimizer, train_dataloader)
[rank15]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1213, in prepare
[rank15]:     result = tuple(
[rank15]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1214, in <genexpr>
[rank15]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank15]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1094, in _prepare_one
[rank15]:     return self.prepare_model(obj, device_placement=device_placement)
[rank15]:   File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1349, in prepare_model
[rank15]:     model = torch.nn.parallel.DistributedDataParallel(
[rank15]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/parallel/distributed.py", line 798, in __init__
[rank15]:     _verify_param_shape_across_processes(self.process_group, parameters)
[rank15]:   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/utils.py", line 269, in _verify_param_shape_across_processes
[rank15]:     return dist._verify_params_across_processes(process_group, tensors, logger)
[rank15]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1970, internal error - please report this issue to the NCCL developers, NCCL version 2.20.5
[rank15]: ncclInternalError: Internal check failed.
[rank15]: Last error:
[rank15]: Message truncated : received 2048 bytes instead of 256

Additional Context

Even after changing the NCCL version and the PyTorch version, the same error occurred. I would greatly appreciate any guidance or suggestions to resolve this issue. Thank you.

kiskra-nvidia commented 1 month ago

Could you try with a newer NCCL version (preferably the latest 2.22.3)? There was a packet reordering issue during bootstrap that we've fixed in 2.21, which would result in Message truncated errors like what you observe on Servers 2 and 3.

If that doesn't help, then we'll need to see the NCCL output from all three servers obtained when running NCCL with NCCL_DEBUG=INFO.

asdfry commented 1 month ago

I solved it with the command below. Thank you for your response.

Server 1: NCCL_NVLS_ENABLE=0 torchrun --nnodes=3 --nproc_per_node=8 --node_rank=0 --master_addr=10.244.2.207 --master_port=1040 train.py -b 1 -dn tldr_news -m bloom-560m