THUDM / ChatGLM3

ChatGLM3 series: Open Bilingual Chat LLMs | 开源双语对话语言模型
Apache License 2.0
13.31k stars 1.55k forks source link

单机多卡Lora微调总是出现nccl错误 #1174

Closed Hxinyue closed 4 months ago

Hxinyue commented 4 months ago

System Info / 系統信息

python3.10.13
torch2.0.1+cu118 NCCL version 2.14.3 transformers4.39.3

Who can help? / 谁可以帮助到您?

@Btlmd

Information / 问题信息

Reproduction / 复现过程

1、OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=4 finetune_hf.py data/AdvertiseGen_fix/ /usr/share/sc5tadm/HXY/chatglm3_6B/ChatGLM3/chatglm3-6b configs/lora.yaml configs/ds_zero_2.json 2、报错信息: DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Failed to find reverse path from remNode 0/c1000 nlinks 4 to node 0/1d000 ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /opt/data/sc5tadm/HXY/chatglm3_6B/ChatGLM3/finetune_demo/finetune_hf.py:570 in main │ │ │ │ 567 │ │ │ 568 │ # test stage │ │ 569 │ if test_dataset is not None: │ │ ❱ 570 │ │ trainer.predict(test_dataset) │ │ 571 │ │ 572 │ │ 573 if name == 'main': │ │ │ │ /opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/transformers/trainer_seq2 │ │ seq.py:244 in predict │ │ │ │ 241 │ │ self.gather_function = self.accelerator.gather │ │ 242 │ │ self._gen_kwargs = gen_kwargs │ │ 243 │ │ │ │ ❱ 244 │ │ return super().predict(test_dataset, ignore_keys=ignore_keys, metric_key_prefix= │ │ 245 │ │ │ 246 │ def prediction_step( │ │ 247 │ │ self, │ │ │ │ /opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/transformers/trainer.py:3 │ │ 441 in predict │ │ │ │ 3438 │ │ start_time = time.time() │ │ 3439 │ │ │ │ 3440 │ │ eval_loop = self.prediction_loop if self.args.use_legacy_prediction_loop else se │ │ ❱ 3441 │ │ output = eval_loop( │ │ 3442 │ │ │ test_dataloader, description="Prediction", ignore_keys=ignore_keys, metric_k │ │ 3443 │ │ ) │ │ 3444 │ │ total_batch_size = self.args.eval_batch_size self.args.world_size │ │ │ │ /opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/transformers/trainer.py:3 │ │ 566 in evaluation_loop │ │ │ │ 3563 │ │ │ │ losses = self.gather_function((loss.repeat(batch_size))) │ │ 3564 │ │ │ │ losses_host = losses if losses_host is None else nested_concat(losses_ho │ │ 3565 │ │ │ if labels is not None: │ │ ❱ 3566 │ │ │ │ labels = self.accelerator.pad_across_processes(labels, dim=1, pad_index= │ │ 3567 │ │ │ if inputs_decode is not None: │ │ 3568 │ │ │ │ inputs_decode = self.accelerator.pad_across_processes(inputs_decode, dim │ │ 3569 │ │ │ │ inputs_decode = self.gather_function((inputs_decode)) │ │ │ │ /opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/accelerate/accelerator.py │ │ :2350 in pad_across_processes │ │ │ │ 2347 │ │ torch.Size([2]) │ │ 2348 │ │ ``` │ │ 2349 │ │ """ │ │ ❱ 2350 │ │ return pad_across_processes(tensor, dim=dim, pad_index=pad_index, padfirst=pad │ │ 2351 │ │ │ 2352 │ def unwrap_model(self, model, keep_fp32_wrapper: bool = True): │ │ 2353 │ │ """ │ │ │ │ /opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/accelerate/utils/operatio │ │ ns.py:417 in wrapper │ │ │ │ 414 │ @wraps(function) │ │ 415 │ def wrapper(args, kwargs): │ │ 416 │ │ try: │ │ ❱ 417 │ │ │ return function(*args, *kwargs) │ │ 418 │ │ except DistributedOperationException as e: │ │ 419 │ │ │ operation = f"{function.module}.{function.name}" │ │ 420 │ │ │ raise DistributedOperationException( │ │ │ │ /opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/accelerate/utils/operatio │ │ ns.py:684 in pad_across_processes │ │ │ │ 681 │ │ new_tensor[indices] = tensor │ │ 682 │ │ return new_tensor │ │ 683 │ │ │ ❱ 684 │ return recursively_apply( │ │ 685 │ │ _pad_across_processes, tensor, error_on_other_type=True, dim=dim, padindex=pad │ │ 686 │ ) │ │ 687 │ │ │ │ /opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/accelerate/utils/operatio │ │ ns.py:126 in recursively_apply │ │ │ │ 123 │ │ │ } │ │ 124 │ │ ) │ │ 125 │ elif test_type(data): │ │ ❱ 126 │ │ return func(data, args, kwargs) │ │ 127 │ elif error_on_other_type: │ │ 128 │ │ raise TypeError( │ │ 129 │ │ │ f"Unsupported types ({type(data)}) passed to {func.__name__}. Only nested │ │ │ │ /opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/accelerate/utils/operatio │ │ ns.py:665 in _pad_across_processes │ │ │ │ 662 │ │ │ │ 663 │ │ # Gather all sizes │ │ 664 │ │ size = torch.tensor(tensor.shape, device=tensor.device)[None] │ │ ❱ 665 │ │ sizes = gather(size).cpu() │ │ 666 │ │ # Then pad to the maximum size │ │ 667 │ │ max_size = max(s[dim] for s in sizes) │ │ 668 │ │ if max_size == tensor.shape[dim]: │ │ │ │ /opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/accelerate/utils/operatio │ │ ns.py:381 in wrapper │ │ │ │ 378 │ @wraps(function) │ │ 379 │ def wrapper(*args, kwargs): │ │ 380 │ │ if PartialState().distributed_type == DistributedType.NO or not PartialState().d │ │ ❱ 381 │ │ │ return function(*args, *kwargs) │ │ 382 │ │ operation = f"{function.module}.{function.name}" │ │ 383 │ │ if "tensor" in kwargs: │ │ 384 │ │ │ tensor = kwargs["tensor"] │ │ │ │ /opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/accelerate/utils/operatio │ │ ns.py:442 in gather │ │ │ │ 439 │ if PartialState().distributed_type == DistributedType.XLA: │ │ 440 │ │ return _tpu_gather(tensor) │ │ 441 │ elif PartialState().distributed_type in TORCH_DISTRIBUTED_OPERATION_TYPES: │ │ ❱ 442 │ │ return _gpu_gather(tensor) │ │ 443 │ else: │ │ 444 │ │ return tensor │ │ 445 │ │ │ │ /opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/accelerate/utils/operatio │ │ ns.py:361 in _gpu_gather │ │ │ │ 358 │ │ │ torch.distributed.all_gather(output_tensors, tensor) │ │ 359 │ │ │ return torch.cat(output_tensors, dim=0) │ │ 360 │ │ │ ❱ 361 │ return recursively_apply(_gpu_gather_one, tensor, error_on_other_type=True) │ │ 362 │ │ 363 │ │ 364 class DistributedOperationException(Exception): │ │ │ │ /opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/accelerate/utils/operatio │ │ ns.py:126 in recursively_apply │ │ │ │ 123 │ │ │ } │ │ 124 │ │ ) │ │ 125 │ elif test_type(data): │ │ ❱ 126 │ │ return func(data, args, kwargs) │ │ 127 │ elif error_on_other_type: │ │ 128 │ │ raise TypeError( │ │ 129 │ │ │ f"Unsupported types ({type(data)}) passed to {func.__name__}. Only nested │ │ │ │ /opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/accelerate/utils/operatio │ │ ns.py:351 in _gpu_gather_one │ │ │ │ 348 │ │ │ │ dtype=tensor.dtype, │ │ 349 │ │ │ │ device=state.device, │ │ 350 │ │ │ ) │ │ ❱ 351 │ │ │ gather_op(output_tensors, tensor) │ │ 352 │ │ │ return output_tensors.view(-1, tensor.size()[1:]) │ │ 353 │ │ else: │ │ 354 │ │ │ # a backend of None is always CPU │ │ │ │ /opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/torch/distributed/distrib │ │ uted_c10d.py:1451 in wrapper │ │ │ │ 1448 │ @functools.wraps(func) │ │ 1449 │ def wrapper(args, kwargs): │ │ 1450 │ │ try: │ │ ❱ 1451 │ │ │ return func(*args, *kwargs) │ │ 1452 │ │ except Exception as error: │ │ 1453 │ │ │ if is_initialized(): │ │ 1454 │ │ │ │ error_msg_dict = { │ │ │ │ /opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/torch/distributed/distrib │ │ uted_c10d.py:2530 in all_gather_into_tensor │ │ │ │ 2527 │ │ │ 2528 │ if group is None: │ │ 2529 │ │ default_pg = _get_default_group() │ │ ❱ 2530 │ │ work = default_pg._allgather_base(output_tensor, input_tensor) │ │ 2531 │ else: │ │ 2532 │ │ work = group._allgather_base(output_tensor, input_tensor) │ │ 2533 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Failed to find reverse path from remNode 0/c1000 nlinks 4 to node 0/1d000 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1957665 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1957662) of binary: /opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/bin/python3.10 Traceback (most recent call last): File "/opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/bin/torchrun", line 8, in sys.exit(main()) File "/opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(args, kwargs) File "/opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/data/sc5tadm/HXY/chatglm3_6B/glm3-6B/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune_hf.py FAILED

Failures: [1]: time : 2024-04-24_10:37:08 host : sccn11.supcon5t.com rank : 1 (local_rank: 1) exitcode : 1 (pid: 1957663) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-04-24_10:37:08 host : sccn11.supcon5t.com rank : 2 (local_rank: 2) exitcode : 1 (pid: 1957664) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-04-24_10:37:08 host : sccn11.supcon5t.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 1957662) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Expected behavior / 期待表现

成功实现单机多卡运行Lora微调demo