QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Apache License 2.0
13.59k stars 1.11k forks source link

[BUG] <title> 使用 finetune_lora_ds.sh 脚本跑qwen-14B 的一级多卡lora精调,任务失败 #936

Closed ghost closed 8 months ago

ghost commented 8 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

在一机八卡的A100(40GB)机器上,运行finetune_lora_ds.sh脚本,进行14B模型的分布式精调,得到如下的错误报告 Traceback (most recent call last): File "finetune.py", line 360, in train() File "finetune.py", line 353, in train trainer.train() File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1555, in train return inner_training_loop( File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1687, in _inner_training_loop model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1214, in prepare raise ValueError( ValueError: You can't train a model that has been loaded with device_map='auto' in any distributed mode. Please rerun your script specifying --num_processes=1 or by launching with python {{myscript.py}}.

追溯一下原因,是因为项目给出的 finetune.py脚本不支持分布式精调

This serves for single-gpu qlora.

if getattr(training_args, 'deepspeed', None) and int(os.environ.get("WORLD_SIZE", 1))==1:
    training_args.distributed_state.distributed_type = DistributedType.DEEPSPEED

local_rank = training_args.local_rank

device_map = "auto"
world_size = int(os.environ.get("WORLD_SIZE", 1))
ddp = world_size != 1
if lora_args.q_lora:
    device_map = {"": int(os.environ.get("LOCAL_RANK") or 0)} if ddp else "auto"
    if len(training_args.fsdp) > 0 or deepspeed.is_deepspeed_zero3_enabled():
        logging.warning(
            "FSDP or ZeRO3 are incompatible with QLoRA."
        )

期望行为 | Expected Behavior

请问支持分布式精调的 finetune.py 代码应该如何修改?

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS: CentOS Linux 7 (版本 3.10.0-862.el7.x86_64)
- Python: 3.8.10
- Transformers:4.32.0
- PyTorch:2.0.1+cu117
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):11.7

备注 | Anything else?

error_log.txt

fyabc commented 8 months ago

你好,我们最新版本的finetune.py已修复了这个问题,请拉取最新代码之后重新尝试一下。