Facico / Chinese-Vicuna

Chinese-Vicuna: A Chinese Instruction-following LLaMA-based Model —— 一个中文低资源的llama+lora方案,结构参考alpaca
https://github.com/Facico/Chinese-Vicuna
Apache License 2.0
4.14k stars 425 forks source link

在2080ti上运行 finetune提示错误 #91

Closed grantchenhuarong closed 1 year ago

grantchenhuarong commented 1 year ago

如果你遇到问题需要我们帮助,你可以从以下角度描述你的信息,以便于我们可以理解或者复现你的错误(学会如何提问不仅是能帮助我们理解你,也是一个自查的过程): 1、你使用了哪个脚本、使用的什么命令 bash finetune.sh 2、你的参数是什么(脚本参数、命令参数) TOT_CUDA="0" CUDAs=(${TOT_CUDA//,/ }) CUDA_NUM=${#CUDAs[@]} PORT="12345"

DATA_PATH="./sample/merge_sample.json" #"../dataset/instruction/guanaco_non_chat_mini_52K-utf8.json" #"./sample/merge_sample.json" OUTPUT_PATH="lora-Vicuna" MODEL_PATH="/data/ftp/models/llama/7b" lora_checkpoint="./lora-Vicuna/checkpoint-11600" TEST_SIZE=1

CUDA_VISIBLE_DEVICES=${TOT_CUDA} torchrun --nproc_per_node=$CUDA_NUM --master_port=$PORT finetune.py \ --data_path $DATA_PATH \ --output_path $OUTPUT_PATH \ --model_path $MODEL_PATH \ --eval_steps 200 \ --save_steps 200 \ --test_size $TEST_SIZE

3、你是否修改过我们的代码 未

4、你用的哪个数据集 ./sample/merge_sample.json

如果上面都是保持原样的,你可以描述“我用的哪个脚本、命令,跑了哪个任务,然后其他参数、数据都和你们一致”,便于我们平行地理解你们的问题。

然后你可以从环境的角度描述你的问题,这些问题我们在readme已经相关的问题及解决可能会有描述: 1、哪个操作系统 centos7 2、使用的什么显卡、多少张 2080ti(11gb) 1张 3、python的版本 3.8.16 4、python各种库的版本 transformers 4.28.1 其它使用requirements.txt安装

然后你也可以从运行的角度来描述你的问题: 1、报错信息是什么,是哪个代码的报错(可以将完整的报错信息都发给我们) trainer.train(resume_from_checkpoint=args.resume_from_checkpoint) RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the forward function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple checkpoint functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations. Parameter at index 127 has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration. You can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print parameter names for further debuggin

使用 TORCH_DISTRIBUTED_DEBUG=DETAIL bash finetune.sh 得到更加详细的是最后一层参数更新异常如下: Parameter at index 127 with name base_model.model.model.layers.31.self_attn.v_proj.lora_B.weight has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.

2、GPU、CPU是否工作正常 正常

同时你也可以看看issue,或者我们整理的信息里面有没有类似的问题相关的问题及解决

当然这只是个提问说明,你没有必要一一按照里面的内容来提问。

grantchenhuarong commented 1 year ago

好吧,改单机版本跑就没有这个问题了

python finetune.py --data_path "./sample/merge_sample.json" \ --output_path "lora-Vicuna" \ --model_path "/data/ftp/models/llama/7b" \ --eval_steps 200 \ --save_steps 200 \ --test_size 1

只是11GB跑起来,改小批次参数,不断的GPU。。。OOM 有真正在2080ti上跑起来finetune的兄弟姐妹们么?

grantchenhuarong commented 1 year ago

确认是train的过程正常,但是在保存模型的时候提示异常了。。。

model.save_pretrained(OUTPUT_DIR)

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 44.00 MiB (GPU 0; 10.75 GiB total capacity; 9.56 GiB already allocated; 41.50 MiB free; 9.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Facico commented 1 year ago

单卡跑torchrun是会有问题,建议多卡用torchrun,issue里面有很多类似的问题,如这个 下面这个问题是你训练的时候正常但是保存会OOM吗?你试试把transformers的版本降一下,或者用这个试试pip install git+https://github.com/huggingface/transformers@ff20f9cf3615a8638023bc82925573cb9d0f3560

grantchenhuarong commented 1 year ago

transformers从4.28.1降回到4.28.0.dev版本了,结果类似。

训练的时候总共占用8.2GB,就是保存模型的时候,应该是在克隆权重的时候,爆内存了。显示如下。

File "finetune.py", line 278, in model.save_pretrained(OUTPUT_DIR) File "/home/ai/.conda/envs/chinesevicuna/lib/python3.8/site-packages/peft/peft_model.py", line 103, in save_pretrained output_state_dict = get_peft_model_state_dict(self, kwargs.get("state_dict", None)) File "/home/ai/.conda/envs/chinesevicuna/lib/python3.8/site-packages/peft/utils/save_and_load.py", line 31, in get_peft_model_state_dict state_dict = model.statedict() File "finetune.py", line 268, in lambda self, *, **__: get_peft_model_state_dict(self, old_state_dict()) File "/home/ai/.conda/envs/chinesevicuna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/home/ai/.conda/envs/chinesevicuna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) File "/home/ai/.conda/envs/chinesevicuna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1818, in state_dict module.state_dict(destination=destination, prefix=prefix + name + '.', keep_vars=keep_vars) [Previous line repeated 4 more times] File "/home/ai/.conda/envs/chinesevicuna/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1815, in state_dict self._save_to_state_dict(destination, prefix, keep_vars) File "/home/ai/.conda/envs/chinesevicuna/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 262, in _save_to_state_dict weight_clone = self.weight.data.clone()

grantchenhuarong commented 1 year ago

怀疑是在保存模型的时候,是否自动做了精度转换导致的?

Facico commented 1 year ago

你看一下这个issue,可能是bitsandbytes版本的问题

grantchenhuarong commented 1 year ago

谢谢,确实是,问题解决了。

(chinesevicuna) ai@ai-2080ti:~/src/Chinese-Vicuna$ pip list|grep bitsandbytes bitsandbytes 0.38.1 (chinesevicuna) ai@ai-2080ti:~/src/Chinese-Vicuna$ pip install bitsandbytes==0.37.2 Collecting bitsandbytes==0.37.2 Using cached bitsandbytes-0.37.2-py3-none-any.whl (84.2 MB) Installing collected packages: bitsandbytes Attempting uninstall: bitsandbytes Found existing installation: bitsandbytes 0.38.1 Uninstalling bitsandbytes-0.38.1: Successfully uninstalled bitsandbytes-0.38.1 Successfully installed bitsandbytes-0.37.2