OptimalScale / LMFlow

An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.
https://optimalscale.github.io/LMFlow/
Apache License 2.0
8.3k stars 827 forks source link

KeyError: 'gpt_bigcode' #601

Closed xiaohangguo closed 1 year ago

xiaohangguo commented 1 year ago

运行脚本:

#!/bin/bash
# Please run this script under ${project_id} in project directory of

deepspeed_args="--master_port=11000"      # Default argument
if [ $# -ge 1 ]; then
  deepspeed_args="$1"
fi

exp_id=finetune_with_lora
project_dir=$(cd "$(dirname $0)"/..; pwd)
output_dir=${project_dir}/output_models/${exp_id}
log_dir=${project_dir}/log/${exp_id}

dataset_path=${project_dir}/data/Ted_hubAndtest_datasets/train

mkdir -p ${output_dir} ${log_dir}

deepspeed ${deepspeed_args} \
  examples/finetune.py \
    --model_name_or_path bigcode/starcoderbase-7b \
    --dataset_path ${dataset_path} \
    --output_dir ${output_dir} --overwrite_output_dir \
    --num_train_epochs 3 \
    --learning_rate 1e-4 \
    --block_size 512 \
    --per_device_train_batch_size 1 \
    --use_lora 1 \
    --lora_r 8 \
    --save_aggregated_lora 0\
    --deepspeed configs/ds_config_zero3.json \
    --fp16 \
    --run_name finetune_with_lora \
    --validation_split_percentage 0 \
    --logging_steps 20 \
    --do_train \
    --ddp_timeout 72000 \
    --save_steps 1000 \
    --dataloader_num_workers 1 \
    | tee ${log_dir}/train.log \
    2> ${log_dir}/train.err

运行以后会直接爆KeyError: 'gpt_bigcode' 的错误,经过检查,是transformer版本的问题,lmflow使用的是transformers-4.28.0.dev0 , 错误:

  File "/home/hang/桌面/LMFlow/examples/finetune.py", line 61, in <module>
    main()
  File "/home/hang/桌面/LMFlow/examples/finetune.py", line 54, in main
    model = AutoModel.get_model(model_args)
  File "/home/hang/桌面/LMFlow/src/lmflow/models/auto_model.py", line 16, in get_model
    return HFDecoderModel(model_args, *args, **kwargs)
  File "/home/hang/桌面/LMFlow/src/lmflow/models/hf_decoder_model.py", line 156, in __init__
    config = AutoConfig.from_pretrained(model_args.model_name_or_path, **config_kwargs)
  File "/home/hang/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 929, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/home/hang/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 635, in __getitem__
    raise KeyError(key)
KeyError: 'gpt_bigcode'

然后比较年轻的想法就是直接卸载transformers,然后重装最新版。但是发现根本不行

Traceback (most recent call last):
  File "/home/lvshuhang/gpt/LMFlow/examples/finetune.py", line 30, in <module>
    from lmflow.pipeline.auto_pipeline import AutoPipeline
  File "/home/lvshuhang/gpt/LMFlow/src/lmflow/pipeline/auto_pipeline.py", line 9, in <module>
    from lmflow.pipeline.raft_aligner import RaftAligner
  File "/home/lvshuhang/gpt/LMFlow/src/lmflow/pipeline/raft_aligner.py", line 32, in <module>
    from lmflow.pipeline.utils.raft_trainer import RaftTrainer
  File "/home/lvshuhang/gpt/LMFlow/src/lmflow/pipeline/utils/raft_trainer.py", line 23, in <module>
    from transformers.integrations import (
ImportError: cannot import name 'default_hp_search_backend' from 'transformers.integrations' (/home/lvshuhang/.local/lib/python3.9/site-packages/transformers/integrations.py)

发现是最新版的这个引用方法弃用了,总之就是不这么写了。

然后检查源码,解决方案是升级到4.30.1,因为这个版本又有default_hp_search_backend,然后又集成了gpt_bigcode,所以符合lmflow的代码依赖 pip install transformers==4.30.1 非常的不幸,又来了新的错误:

Traceback (most recent call last):
  File "/home/lvshuhang/gpt/LMFlow/examples/finetune.py", line 64, in <module>
    main()
  File "/home/lvshuhang/gpt/LMFlow/examples/finetune.py", line 57, in main
    model = AutoModel.get_model(model_args)
  File "/home/lvshuhang/gpt/LMFlow/src/lmflow/models/auto_model.py", line 16, in get_model
    return HFDecoderModel(model_args, *args, **kwargs)
  File "/home/lvshuhang/gpt/LMFlow/src/lmflow/models/hf_decoder_model.py", line 259, in __init__
    model = get_peft_model(model, peft_config)
  File "/home/lvshuhang/.local/lib/python3.9/site-packages/peft/mapping.py", line 144, in get_peft_model
    peft_config = _prepare_lora_config(peft_config, model_config)
  File "/home/lvshuhang/.local/lib/python3.9/site-packages/peft/mapping.py", line 119, in _prepare_lora_config
    raise ValueError("Please specify `target_modules` in `peft_config`")
ValueError: Please specify `target_modules` in `peft_config`
Traceback (most recent call last):
  File "/home/lvshuhang/gpt/LMFlow/examples/finetune.py", line 64, in <module>
    main()
  File "/home/lvshuhang/gpt/LMFlow/examples/finetune.py", line 57, in main
    model = AutoModel.get_model(model_args)
  File "/home/lvshuhang/gpt/LMFlow/src/lmflow/models/auto_model.py", line 16, in get_model
    return HFDecoderModel(model_args, *args, **kwargs)
  File "/home/lvshuhang/gpt/LMFlow/src/lmflow/models/hf_decoder_model.py", line 259, in __init__
    model = get_peft_model(model, peft_config)
  File "/home/lvshuhang/.local/lib/python3.9/site-packages/peft/mapping.py", line 144, in get_peft_model
    peft_config = _prepare_lora_config(peft_config, model_config)
  File "/home/lvshuhang/.local/lib/python3.9/site-packages/peft/mapping.py", line 119, in _prepare_lora_config
    raise ValueError("Please specify `target_modules` in `peft_config`")
ValueError: Please specify `target_modules` in `peft_config`

然后索引到mapping,发现是lora一类的错误,和模型不支持相关。

TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING = {
    "t5": ["q", "v"],
    "mt5": ["q", "v"],
    "bart": ["q_proj", "v_proj"],
    "gpt2": ["c_attn"],
    "bloom": ["query_key_value"],
    "opt": ["q_proj", "v_proj"],
    "gptj": ["q_proj", "v_proj"],
    "gpt_neox": ["query_key_value"],
    "gpt_neo": ["q_proj", "v_proj"],
    "bert": ["query", "value"],
    "roberta": ["query", "value"],
    "xlm-roberta": ["query", "value"],
    "electra": ["query", "value"],
    "deberta-v2": ["query_proj", "value_proj"],
    "deberta": ["in_proj"],
    "layoutlm": ["query", "value"],
    "llama": ["q_proj", "v_proj"],
    "chatglm": ["query_key_value"],
}

这里面没写 这个模型如何合并,我推测肯定就不支持这个模型lora训练了。 然后年轻的想法又产生了,那我应该可以试试全量微调。 然后又爆了deepspeed的错误:

Traceback (most recent call last):
  File "/home/lvshuhang/gpt/LMFlow/examples/finetune.py", line 64, in <module>
    main()
  File "/home/lvshuhang/gpt/LMFlow/examples/finetune.py", line 60, in main
    tuned_model = finetuner.tune(model=model, dataset=dataset)
  File "/home/lvshuhang/gpt/LMFlow/src/lmflow/pipeline/finetuner.py", line 277, in tune
    trainer = FinetuningTrainer(
  File "/home/lvshuhang/.local/lib/python3.9/site-packages/transformers/trainer.py", line 349, in __init__
    self.create_accelerator_and_postprocess()
  File "/home/lvshuhang/.local/lib/python3.9/site-packages/transformers/trainer.py", line 3968, in create_accelerator_and_postprocess
    self.accelerator = Accelerator(
  File "/home/lvshuhang/.local/lib/python3.9/site-packages/accelerate/accelerator.py", line 290, in __init__
    raise ImportError("DeepSpeed version must be >= 0.9.3. Please update DeepSpeed.")
ImportError: DeepSpeed version must be >= 0.9.3. Please update DeepSpeed.

ok我升级 pip install deepspeed==0.9.3 又爆了优化器的错误。

Traceback (most recent call last):
  File "/home/lvshuhang/gpt/LMFlow/examples/finetune.py", line 64, in <module>
    main()
  File "/home/lvshuhang/gpt/LMFlow/examples/finetune.py", line 60, in main
    tuned_model = finetuner.tune(model=model, dataset=dataset)
  File "/home/lvshuhang/gpt/LMFlow/src/lmflow/pipeline/finetuner.py", line 298, in tune
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/lvshuhang/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1645, in train
    return inner_training_loop(
  File "/home/lvshuhang/.local/lib/python3.9/site-packages/transformers/trainer.py", line 1725, in _inner_training_loop
    self.optimizer, self.lr_scheduler = deepspeed_init(self, num_training_steps=max_steps)
  File "/home/lvshuhang/.local/lib/python3.9/site-packages/transformers/deepspeed.py", line 361, in deepspeed_init
    optimizer, lr_scheduler = deepspeed_optim_sched(
  File "/home/lvshuhang/.local/lib/python3.9/site-packages/transformers/deepspeed.py", line 307, in deepspeed_optim_sched
    raise ValueError(
ValueError: Found `optimizer` configured in the DeepSpeed config, but no `scheduler`. Please configure a scheduler in the DeepSpeed config.
xiaohangguo commented 1 year ago

对了,我这个服务器是泰坦X的卡,然后CUDA是12.2,我用sandbitbytes的0.38.1是会报错的,我必须源码安装最新的0.41的才没问题。然而我本地安装0.38.1装是没问题的。这个还不知道原因。但是经过我的验证,本地和服务器都不支持这个starcoder-7b

shizhediao commented 1 year ago

我们最近正在进行transformers版本的升级。预计两天内完成,到时候可以试一下

xiaohangguo commented 1 year ago

我们最近正在进行transformers版本的升级。预计两天内完成,到时候可以试一下

可以呀,没问题,非常感谢~

xiaohangguo commented 1 year ago

ok,我又试了下最新的0.0.5的代码,还是报一样的错误

(lmflow) hang@hang-System-Product-Name:~/桌面/LMFlow/LMFlow-v0.5$ bash ./scripts/run_finetune.sh 
[2023-09-04 16:06:45,072] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-09-04 16:06:45,080] [INFO] [runner.py:550:main] cmd = /home/hang/anaconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path bigcode/starcoderbase-1b --dataset_path TED-data-meta1 --output_dir output_models/finetune_starcoder-1b --overwrite_output_dir --num_train_epochs 0.01 --learning_rate 2e-5 --block_size 1024 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --fp16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2023-09-04 16:06:45,843] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-09-04 16:06:45,843] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-09-04 16:06:45,843] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-09-04 16:06:45,843] [INFO] [launch.py:162:main] dist_world_size=1
[2023-09-04 16:06:45,843] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/hang/anaconda3/envs/lmflow/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda116.so
/home/hang/anaconda3/envs/lmflow/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/hang/anaconda3/envs/lmflow/lib/libcudart.so.11.0'), PosixPath('/home/hang/anaconda3/envs/lmflow/lib/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
  warn(msg)
CUDA SETUP: CUDA runtime path found: /home/hang/anaconda3/envs/lmflow/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 116
CUDA SETUP: Loading binary /home/hang/anaconda3/envs/lmflow/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda116.so...
[2023-09-04 16:06:47,969] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
09/04/2023 16:06:49 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True
09/04/2023 16:06:50 - WARNING - datasets.builder - Found cached dataset json (/home/hang/.cache/huggingface/datasets/json/default-38464ccca6b505a7/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Downloading (…)okenizer_config.json: 100%|██████████████████████████████████████████████████| 677/677 [00:00<00:00, 337kB/s]
Downloading (…)olve/main/vocab.json: 100%|████████████████████████████████████████████████| 777k/777k [00:00<00:00, 837kB/s]
Downloading (…)olve/main/merges.txt: 100%|███████████████████████████████████████████████| 442k/442k [00:00<00:00, 1.85MB/s]
Downloading (…)/main/tokenizer.json: 100%|█████████████████████████████████████████████| 2.06M/2.06M [00:01<00:00, 1.99MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████████████████████████████████████████████| 532/532 [00:00<00:00, 366kB/s]
Downloading (…)lve/main/config.json: 100%|██████████████████████████████████████████████| 1.05k/1.05k [00:00<00:00, 401kB/s]
Traceback (most recent call last):
  File "/home/hang/桌面/LMFlow/LMFlow-v0.5/examples/finetune.py", line 61, in <module>
    main()
  File "/home/hang/桌面/LMFlow/LMFlow-v0.5/examples/finetune.py", line 54, in main
    model = AutoModel.get_model(model_args)
  File "/home/hang/桌面/LMFlow/LMFlow-v0.5/src/lmflow/models/auto_model.py", line 16, in get_model
    return HFDecoderModel(model_args, *args, **kwargs)
  File "/home/hang/桌面/LMFlow/LMFlow-v0.5/src/lmflow/models/hf_decoder_model.py", line 188, in __init__
    config = AutoConfig.from_pretrained(model_args.model_name_or_path, **config_kwargs)
  File "/home/hang/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 929, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
  File "/home/hang/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 635, in __getitem__
    raise KeyError(key)
KeyError: 'gpt_bigcode'
[2023-09-04 16:07:01,863] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 18974
[2023-09-04 16:07:01,864] [ERROR] [launch.py:324:sigkill_handler] ['/home/hang/anaconda3/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'bigcode/starcoderbase-1b', '--dataset_path', 'TED-data-meta1', '--output_dir', 'output_models/finetune_starcoder-1b', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--block_size', '1024', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1

配置项:

#!/bin/bash
# Please run this script under ${project_id} in project directory of
#   https://github.com/shizhediao/llm-ft
#     COMMIT: d5fecf30ba8011067b10cf51fede53a5ab6574e4

# Parses arguments
model_name_or_path=bigcode/starcoderbase-1b
dataset_path=TED-data-meta1
output_dir=output_models/finetune_starcoder-1b
deepspeed_args="--master_port=11000"

while [[ $# -ge 1 ]]; do
  key="$1"
  case ${key} in
    -m|--model_name_or_path)
      model_name_or_path="$2"
      shift
      ;;
    -d|--dataset_path)
      dataset_path="$2"
      shift
      ;;
    -o|--output_model_path)
      output_dir="$2"
      shift
      ;;
    --deepspeed_args)
      deepspeed_args="$2"
      shift
      ;;
    *)
      echo "error: unknown option \"${key}\"" 1>&2
      exit 1
  esac
  shift
done

# Finetune
exp_id=finetune
project_dir=$(cd "$(dirname $0)"/..; pwd)
log_dir=${project_dir}/log/${exp_id}
mkdir -p ${output_dir} ${log_dir}

deepspeed ${deepspeed_args} \
  examples/finetune.py \
    --model_name_or_path ${model_name_or_path} \
    --dataset_path ${dataset_path} \
    --output_dir ${output_dir} --overwrite_output_dir \
    --num_train_epochs 0.01 \
    --learning_rate 2e-5 \
    --block_size 1024 \
    --per_device_train_batch_size 1 \
    --deepspeed configs/ds_config_zero3.json \
    --fp16 \
    --run_name finetune \
    --validation_split_percentage 0 \
    --logging_steps 20 \
    --do_train \
    --ddp_timeout 72000 \
    --save_steps 5000 \
    --dataloader_num_workers 1 \
    | tee ${log_dir}/train.log \
    2> ${log_dir}/train.err
research4pan commented 1 year ago

Thanks for your attention. We haven't upgraded the transformers version at v0.0.5. The upgrade of transformers result in several minor bugs which we are currently fixing, but basic functions should be usable. If you would like to try that version, you may git clone our main branch and check if the problem still occurs. Thanks!

感谢您的关注。我们v0.0.5的版本还没有升级transformers。main branch已经升级了,但还有一些小问题,不过基本的finetune功能应该是可以用的。如果您的卡比较新,可以先用main branch试试。感谢您的理解🙏

xiaohangguo commented 1 year ago

ok, pip install -U transformers from transformers 4.28.0.dev0 to transformers-4.23.1 提示deepspeed>=0.9.3 …… 那我还是先用main的代码吧

xiaohangguo commented 1 year ago

或许 readme 里面文档可以更新一下?

research4pan commented 1 year ago

No problem. We will update the README later accordingly. Thanks for your suggestions!

research4pan commented 1 year ago

Please feel free to reopen this issue or create a new one if any further problem occurs. Thanks!

如果使用过程中有碰到任何其他问题,可以重新打开这个issue,或者开一个新的。感谢支持🙏

xiaohangguo commented 1 year ago

ok,谢谢回复,测试了一下。么问题