InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
https://xtuner.readthedocs.io/zh-cn/latest/
Apache License 2.0
4k stars 314 forks source link

多机多卡训练报错ss1.ss_family == ss2.ss_family. 2 vs 10 #924

Open sph116 opened 2 months ago

sph116 commented 2 months ago

rank0的启动命令 NPROC_PER_NODE=1 NNODES=2 PORT=29600 ADDR=172.18.12.59 NODE_RANK=0 xtuner train train_config/internlm2_5_chat_7b_rank0_server_lora_train.py --deepspeed deepspeed_zero2 rank1的启动命令 NPROC_PER_NODE=1 NNODES=2 PORT=29600 ADDR=172.18.12.59 NODE_RANK=1 xtuner train train_config/internlm2_5_chat_7b_rank1_server_lora_train.py --deepspeed deepspeed_zero2

rank1与rank0通信成功 单卡模式都成功训练

报错日志

rank0: Traceback (most recent call last): rank0: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 360, in

rank0: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/tools/train.py", line 356, in main

rank0: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 1160, in train rank0: self._train_loop = self.build_train_loop( rank0: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 958, in build_train_loop rank0: loop = LOOPS.build( rank0: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build rank0: return self.build_func(cfg, args, kwargs, registry=self) rank0: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg rank0: obj = obj_cls(args) # type: ignore rank0: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/engine/runner/loops.py", line 32, in init rank0: dataloader = runner.build_dataloader( rank0: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/runner/_flexible_runner.py", line 824, in build_dataloader rank0: dataset = DATASETS.build(dataset_cfg) rank0: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/registry.py", line 570, in build rank0: return self.build_func(cfg, args, kwargs, registry=self) rank0: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg rank0: obj = obj_cls(args) # type: ignore rank0: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/xtuner/dataset/huggingface.py", line 305, in process_hf_dataset rank0: group_gloo = dist.new_group(backend='gloo', timeout=xtuner_dataset_timeout) rank0: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 93, in wrapper rank0: func_return = func(*args, **kwargs) rank0: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4125, in new_group rank0: return _new_group_with_tag( rank0: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4205, in _new_group_with_tag rank0: pg, pg_store = _new_process_group_helper( rank0: File "/root/anaconda3/envs/xtuner-env/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1569, in _new_process_group_helper rank0: backend_class = ProcessGroupGloo(backend_prefix_store, group_rank, group_size, timeout=timeout) rank0: RuntimeError: [enforce fail at ../third_party/gloo/gloo/transport/tcp/device.cc:276] ss1.ss_family == ss2.ss_family. 2 vs 10