When running distributed trainingWhen running distributed training, I encounter the following error:
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Net : Connection closed by remote peer user<50260> Environment Details
PyTorch Version: 2.0.1+cu118
CUDA Version: 12.1
NCCL Version: 2.20.5
Python Packages: A full list of installed packages is included below (or linked) for reference.
i set export NCCL_DEBUG=INFO and get:
GroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Net : Connection closed by remote peer user<43236>
Traceback (most recent call last):
File "/data108/user_hzx/SSW/R2GenGPT/train.py", line 55, in <module>
main()
File "/data108/user_hzx/SSW/R2GenGPT/train.py", line 51, in main
train(args)
File "/data108/user_hzx/SSW/R2GenGPT/train.py", line 44, in train
trainer.fit(model, datamodule=dm)
File "/data108/user_hzx/anaconda3/envs/r2gen/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 529, in fit
call._call_and_handle_interrupt(
File "/data108/user_hzx/anaconda3/envs/r2gen/lib/python3.9/site-packages/lightning/pytorch/trainer/call.py", line 41, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/data108/user_hzx/anaconda3/envs/r2gen/lib/python3.9/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 91, in launch
return function(*args, **kwargs)
File "/data108/user_hzx/anaconda3/envs/r2gen/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 568, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/data108/user_hzx/anaconda3/envs/r2gen/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 932, in _run
self.__setup_profiler()
File "/data108/user_hzx/anaconda3/envs/r2gen/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1062, in __setup_profiler
self.profiler.setup(stage=self.state.fn, local_rank=local_rank, log_dir=self.log_dir)
File "/data108/user_hzx/anaconda3/envs/r2gen/lib/python3.9/site-packages/lightning/pytorch/trainer/trainer.py", line 1182, in log_dir
dirpath = self.strategy.broadcast(dirpath)
File "/data108/user_hzx/anaconda3/envs/r2gen/lib/python3.9/site-packages/lightning/pytorch/strategies/ddp.py", line 291, in broadcast
torch.distributed.broadcast_object_list(obj, src, group=_group.WORLD)
File "/data108/user_hzx/anaconda3/envs/r2gen/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
return func(*args, **kwargs)
File "/data108/user_hzx/anaconda3/envs/r2gen/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2255, in broadcast_object_list
broadcast(object_sizes_tensor, src=src, group=group)
File "/data108/user_hzx/anaconda3/envs/r2gen/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1451, in wrapper
return func(*args, **kwargs)
File "/data108/user_hzx/anaconda3/envs/r2gen/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1566, in broadcast
work = default_pg.broadcast([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Net : Connection closed by remote peer user<55170>
user:361377:361377 [1] NCCL INFO cudaDriverVersion 11040
user:361377:361377 [1] NCCL INFO Bootstrap : Using eno1:101.6.68.46<0>
user:361377:361377 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
user:361377:361546 [1] NCCL INFO NET/IB : No device found.
user:361377:361546 [1] NCCL INFO NET/Socket : Using [0]eno1:101.6.68.46<0> [1]vethbd235fb:fe80::a885:fcff:fe93:3eac%vethbd235fb<0>
user:361377:361546 [1] NCCL INFO Using network Socket
user:361377:361546 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
user:361377:361546 [1] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
user:361377:361546 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
user:361377:361546 [1] NCCL INFO Channel 00 : 1[d5000] -> 0[4f000] via SHM/direct/direct
user:361377:361546 [1] NCCL INFO Channel 01 : 1[d5000] -> 0[4f000] via SHM/direct/direct
user:361377:361546 [1] misc/socket.cc:538 NCCL WARN Net : Connection closed by remote peer user<43236>
user:361377:361546 [1] NCCL INFO misc/socket.cc:546 -> 6
user:361377:361546 [1] NCCL INFO misc/socket.cc:558 -> 6
user:361377:361546 [1] NCCL INFO bootstrap.cc:66 -> 6
user:361377:361546 [1] NCCL INFO bootstrap.cc:424 -> 6
user:361377:361546 [1] NCCL INFO transport.cc:108 -> 6
user:361377:361546 [1] NCCL INFO init.cc:790 -> 6
user:361377:361546 [1] NCCL INFO init.cc:1089 -> 6
user:361377:361546 [1] NCCL INFO group.cc:64 -> 6 [Async thread]
user:361377:361377 [1] NCCL INFO group.cc:421 -> 3
user:361377:361377 [1] NCCL INFO group.cc:106 -> 3
user:361377:361377 [1] NCCL INFO comm 0x563d01a5d8e0 rank 1 nranks 2
When running distributed trainingWhen running distributed training, I encounter the following error:
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Net : Connection closed by remote peer user<50260>
Environment Details PyTorch Version: 2.0.1+cu118 CUDA Version: 12.1 NCCL Version: 2.20.5 Python Packages: A full list of installed packages is included below (or linked) for reference.i set export NCCL_DEBUG=INFO and get:
how can i solve it??