NVIDIA / nccl

Optimized primitives for collective multi-GPU communication
Other
3.28k stars 829 forks source link

torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error: #1381

Open 1moye opened 4 months ago

1moye commented 4 months ago

1.基于llamafactory进行多机多卡训练 目前已经排查过两个环境 1.pip list 安装包一致 2.ping的通 端口已经映射到宿主机上了 3.nvcc --version一致 都是cuda11.8 4.数据集一致 5.防火墙已经关闭 6.nccl库 两边都是2.16.2 pytorch nccl库显示的是2.19.3 现在已经升级了nccl库为2.19.3 通过dpkg -l | grep nccl已验证
是2.19.3+cuda12.3(这里没有对应的11.8的版本,只有11.0 和12.0 12.3 向下兼容所以我选了一个12.3) 7.单独在环境上运行 均可以正常训练模型 无论是nccl 是2.16.2还是2.19.3 8.两边驱动可能不一致 nvidia-smi 得到Driver Version不一样

以上就是大致状况 以下是错误报告 如何解决 ] c39f5cf046aa:4533:4533 [0] NCCL INFO cudaDriverVersion 12050 c39f5cf046aa:4533:4533 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.3<0> c39f5cf046aa:4533:4533 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation c39f5cf046aa:4533:4667 [0] NCCL INFO Failed to open libibverbs.so[.1] c39f5cf046aa:4533:4667 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0> c39f5cf046aa:4533:4667 [0] NCCL INFO Using non-device net plugin version 0 c39f5cf046aa:4533:4667 [0] NCCL INFO Using network Socket c39f5cf046aa:4533:4667 [0] NCCL INFO misc/socket.cc:568 -> 2 c39f5cf046aa:4534:4588 [1] NCCL INFO misc/socket.cc:568 -> 2 c39f5cf046aa:4533:4667 [0] NCCL INFO misc/socket.cc:619 -> 2 c39f5cf046aa:4534:4588 [1] NCCL INFO misc/socket.cc:619 -> 2 c39f5cf046aa:4533:4667 [0] NCCL INFO bootstrap.cc:274 -> 2 c39f5cf046aa:4534:4588 [1] NCCL INFO bootstrap.cc:274 -> 2 c39f5cf046aa:4533:4667 [0] NCCL INFO init.cc:1388 -> 2 c39f5cf046aa:4534:4588 [1] NCCL INFO init.cc:1388 -> 2 c39f5cf046aa:4533:4667 [0] NCCL INFO group.cc:64 -> 2 [Async thread] c39f5cf046aa:4534:4588 [1] NCCL INFO group.cc:64 -> 2 [Async thread] c39f5cf046aa:4533:4533 [0] NCCL INFO group.cc:418 -> 2 c39f5cf046aa:4534:4534 [1] NCCL INFO group.cc:418 -> 2 c39f5cf046aa:4533:4533 [0] NCCL INFO group.cc:95 -> 2 c39f5cf046aa:4534:4534 [1] NCCL INFO group.cc:95 -> 2 Traceback (most recent call last): File "/home/wsh/LLaMA-Factory-main/src/llamafactory/launcher.py", line 23, in launch() File "/home/wsh/LLaMA-Factory-main/src/llamafactory/launcher.py", line 19, in launch run_exp() File "/home/wsh/LLaMA-Factory-main/src/llamafactory/train/tuner.py", line 48, in run_exp run_pt(model_args, data_args, training_args, finetuning_args, callbacks) File "/home/wsh/LLaMA-Factory-main/src/llamafactory/train/pt/workflow.py", line 45, in run_pt dataset_module = get_dataset(model_args, data_args, training_args, stage="pt", *tokenizer_module) File "/home/wsh/LLaMA-Factory-main/src/llamafactory/data/loader.py", line 232, in get_dataset with training_args.main_process_first(desc="load dataset"): File "/opt/conda/lib/python3.10/contextlib.py", line 135, in enter return next(self.gen) File "/opt/conda/lib/python3.10/site-packages/transformers/training_args.py", line 2410, in main_process_first dist.barrier() File "/opt/conda/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper return func(args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3439, in barrier work = default_pg.barrier(opts=opts) torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. Last error:

### Tasks