Closed xiaohangguo closed 1 year ago
对了,我这个服务器是泰坦X的卡,然后CUDA是12.2,我用sandbitbytes的0.38.1是会报错的,我必须源码安装最新的0.41的才没问题。然而我本地安装0.38.1装是没问题的。这个还不知道原因。但是经过我的验证,本地和服务器都不支持这个starcoder-7b
我们最近正在进行transformers版本的升级。预计两天内完成,到时候可以试一下
我们最近正在进行transformers版本的升级。预计两天内完成,到时候可以试一下
可以呀,没问题,非常感谢~
ok,我又试了下最新的0.0.5的代码,还是报一样的错误
(lmflow) hang@hang-System-Product-Name:~/桌面/LMFlow/LMFlow-v0.5$ bash ./scripts/run_finetune.sh
[2023-09-04 16:06:45,072] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-09-04 16:06:45,080] [INFO] [runner.py:550:main] cmd = /home/hang/anaconda3/envs/lmflow/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=11000 --enable_each_rank_log=None examples/finetune.py --model_name_or_path bigcode/starcoderbase-1b --dataset_path TED-data-meta1 --output_dir output_models/finetune_starcoder-1b --overwrite_output_dir --num_train_epochs 0.01 --learning_rate 2e-5 --block_size 1024 --per_device_train_batch_size 1 --deepspeed configs/ds_config_zero3.json --fp16 --run_name finetune --validation_split_percentage 0 --logging_steps 20 --do_train --ddp_timeout 72000 --save_steps 5000 --dataloader_num_workers 1
[2023-09-04 16:06:45,843] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]}
[2023-09-04 16:06:45,843] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-09-04 16:06:45,843] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-09-04 16:06:45,843] [INFO] [launch.py:162:main] dist_world_size=1
[2023-09-04 16:06:45,843] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run
python -m bitsandbytes
and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/hang/anaconda3/envs/lmflow/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda116.so
/home/hang/anaconda3/envs/lmflow/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/hang/anaconda3/envs/lmflow/lib/libcudart.so.11.0'), PosixPath('/home/hang/anaconda3/envs/lmflow/lib/libcudart.so')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get `CUDA error: invalid device function` errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /home/hang/anaconda3/envs/lmflow/lib/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 116
CUDA SETUP: Loading binary /home/hang/anaconda3/envs/lmflow/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda116.so...
[2023-09-04 16:06:47,969] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
09/04/2023 16:06:49 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True
09/04/2023 16:06:50 - WARNING - datasets.builder - Found cached dataset json (/home/hang/.cache/huggingface/datasets/json/default-38464ccca6b505a7/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Downloading (…)okenizer_config.json: 100%|██████████████████████████████████████████████████| 677/677 [00:00<00:00, 337kB/s]
Downloading (…)olve/main/vocab.json: 100%|████████████████████████████████████████████████| 777k/777k [00:00<00:00, 837kB/s]
Downloading (…)olve/main/merges.txt: 100%|███████████████████████████████████████████████| 442k/442k [00:00<00:00, 1.85MB/s]
Downloading (…)/main/tokenizer.json: 100%|█████████████████████████████████████████████| 2.06M/2.06M [00:01<00:00, 1.99MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████████████████████████████████████████████| 532/532 [00:00<00:00, 366kB/s]
Downloading (…)lve/main/config.json: 100%|██████████████████████████████████████████████| 1.05k/1.05k [00:00<00:00, 401kB/s]
Traceback (most recent call last):
File "/home/hang/桌面/LMFlow/LMFlow-v0.5/examples/finetune.py", line 61, in <module>
main()
File "/home/hang/桌面/LMFlow/LMFlow-v0.5/examples/finetune.py", line 54, in main
model = AutoModel.get_model(model_args)
File "/home/hang/桌面/LMFlow/LMFlow-v0.5/src/lmflow/models/auto_model.py", line 16, in get_model
return HFDecoderModel(model_args, *args, **kwargs)
File "/home/hang/桌面/LMFlow/LMFlow-v0.5/src/lmflow/models/hf_decoder_model.py", line 188, in __init__
config = AutoConfig.from_pretrained(model_args.model_name_or_path, **config_kwargs)
File "/home/hang/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 929, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
File "/home/hang/anaconda3/envs/lmflow/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 635, in __getitem__
raise KeyError(key)
KeyError: 'gpt_bigcode'
[2023-09-04 16:07:01,863] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 18974
[2023-09-04 16:07:01,864] [ERROR] [launch.py:324:sigkill_handler] ['/home/hang/anaconda3/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=0', '--model_name_or_path', 'bigcode/starcoderbase-1b', '--dataset_path', 'TED-data-meta1', '--output_dir', 'output_models/finetune_starcoder-1b', '--overwrite_output_dir', '--num_train_epochs', '0.01', '--learning_rate', '2e-5', '--block_size', '1024', '--per_device_train_batch_size', '1', '--deepspeed', 'configs/ds_config_zero3.json', '--fp16', '--run_name', 'finetune', '--validation_split_percentage', '0', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--dataloader_num_workers', '1'] exits with return code = 1
配置项:
#!/bin/bash
# Please run this script under ${project_id} in project directory of
# https://github.com/shizhediao/llm-ft
# COMMIT: d5fecf30ba8011067b10cf51fede53a5ab6574e4
# Parses arguments
model_name_or_path=bigcode/starcoderbase-1b
dataset_path=TED-data-meta1
output_dir=output_models/finetune_starcoder-1b
deepspeed_args="--master_port=11000"
while [[ $# -ge 1 ]]; do
key="$1"
case ${key} in
-m|--model_name_or_path)
model_name_or_path="$2"
shift
;;
-d|--dataset_path)
dataset_path="$2"
shift
;;
-o|--output_model_path)
output_dir="$2"
shift
;;
--deepspeed_args)
deepspeed_args="$2"
shift
;;
*)
echo "error: unknown option \"${key}\"" 1>&2
exit 1
esac
shift
done
# Finetune
exp_id=finetune
project_dir=$(cd "$(dirname $0)"/..; pwd)
log_dir=${project_dir}/log/${exp_id}
mkdir -p ${output_dir} ${log_dir}
deepspeed ${deepspeed_args} \
examples/finetune.py \
--model_name_or_path ${model_name_or_path} \
--dataset_path ${dataset_path} \
--output_dir ${output_dir} --overwrite_output_dir \
--num_train_epochs 0.01 \
--learning_rate 2e-5 \
--block_size 1024 \
--per_device_train_batch_size 1 \
--deepspeed configs/ds_config_zero3.json \
--fp16 \
--run_name finetune \
--validation_split_percentage 0 \
--logging_steps 20 \
--do_train \
--ddp_timeout 72000 \
--save_steps 5000 \
--dataloader_num_workers 1 \
| tee ${log_dir}/train.log \
2> ${log_dir}/train.err
Thanks for your attention. We haven't upgraded the transformers version at v0.0.5. The upgrade of transformers result in several minor bugs which we are currently fixing, but basic functions should be usable. If you would like to try that version, you may git clone our main branch and check if the problem still occurs. Thanks!
感谢您的关注。我们v0.0.5的版本还没有升级transformers。main branch已经升级了,但还有一些小问题,不过基本的finetune功能应该是可以用的。如果您的卡比较新,可以先用main branch试试。感谢您的理解🙏
ok, pip install -U transformers
from transformers 4.28.0.dev0 to transformers-4.23.1
提示deepspeed>=0.9.3 ……
那我还是先用main的代码吧
或许 readme 里面文档可以更新一下?
No problem. We will update the README later accordingly. Thanks for your suggestions!
Please feel free to reopen this issue or create a new one if any further problem occurs. Thanks!
如果使用过程中有碰到任何其他问题,可以重新打开这个issue,或者开一个新的。感谢支持🙏
ok,谢谢回复,测试了一下。么问题
运行脚本:
运行以后会直接爆KeyError: 'gpt_bigcode' 的错误,经过检查,是transformer版本的问题,lmflow使用的是transformers-4.28.0.dev0 , 错误:
然后比较年轻的想法就是直接卸载transformers,然后重装最新版。但是发现根本不行
发现是最新版的这个引用方法弃用了,总之就是不这么写了。
然后检查源码,解决方案是升级到
4.30.1
,因为这个版本又有default_hp_search_backend
,然后又集成了gpt_bigcode
,所以符合lmflow的代码依赖pip install transformers==4.30.1
非常的不幸,又来了新的错误:然后索引到mapping,发现是lora一类的错误,和模型不支持相关。
这里面没写 这个模型如何合并,我推测肯定就不支持这个模型lora训练了。 然后年轻的想法又产生了,那我应该可以试试全量微调。 然后又爆了deepspeed的错误:
ok我升级
pip install deepspeed==0.9.3
又爆了优化器的错误。