lizhili commented 8 months ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

运行脚本：finetune_lora_single_gpu.sh 后报错。脚本是于1.23日拉取，内容为： ` export CUDA_DEVICE_MAX_CONNECTIONS=1

MODEL="/home/Qwen/Qwen1.4/model" # Set the path if you do not want to load from huggingface directly

ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.

See the section for finetuning in README for more information.

DATA="/home/LLM_poc/output.json"

function usage() { echo ' Usage: bash finetune/finetune_lora_single_gpu.sh [-m MODEL_PATH] [-d DATA_PATH] ' }

while [[ "$1" != "" ]]; do case $1 in -m | --model ) shift MODEL=$1 ;; -d | --data ) shift DATA=$1 ;; -h | --help ) usage exit 0 ;;

) echo "Unknown argument ${1}" exit 1 ;; esac shift done

export CUDA_VISIBLE_DEVICES=0

python finetune.py \ --model_name_or_path $MODEL \ --data_path $DATA \ --bf16 True \ --output_dir output_qwen \ --num_train_epochs 5 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 10 \ --learning_rate 3e-4 \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.01 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --report_to "none" \ --model_max_length 128 \ --lazy_preprocess True \ --gradient_checkpointing \ --use_lora `

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 5165 G /usr/lib/xorg/Xorg 18MiB | | 0 N/A N/A 5200 G /usr/bin/gnome-shell 70MiB | | 0 N/A N/A 7879 G ...on=20240118-080138.585000 31MiB | | 0 N/A N/A 17079 G /usr/lib/xorg/Xorg 110MiB | | 0 N/A N/A 17207 G /usr/bin/gnome-shell 62MiB | | 1 N/A N/A 5165 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 17079 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+ `

报错： [2024-01-24 10:22:49,095] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference... Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 8.22it/s] trainable params: 676,003,840 || all params: 2,512,832,512 || trainable%: 26.902065170350678 Loading data... Formatting inputs...Skip in lazy mode Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. 0%| | 0/30510 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/cajr/lizl/Qwen/Qwen1.4/Qwen-main/finetune/finetune.py", line 374, in <module> train() File "/home/cajr/lizl/Qwen/Qwen1.4/Qwen-main/finetune/finetune.py", line 367, in train trainer.train() File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train return inner_training_loop( File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1916, in _inner_training_loop self.optimizer.step() File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step self.optimizer.step(closure) File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(*args, **kwargs) File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(*args, **kwargs) File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 33, in _use_grad ret = func(self, *args, **kwargs) File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/torch/optim/adamw.py", line 171, in step adamw( File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/torch/optim/adamw.py", line 321, in adamw func( File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/torch/optim/adamw.py", line 566, in _multi_tensor_adamw denom = torch._foreach_add(exp_avg_sq_sqrt, eps) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 11.76 GiB total capacity; 10.36 GiB already allocated; 22.06 MiB free; 10.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%|

期望行为 | Expected Behavior

能够正常跑通训练。

复现方法 | Steps To Reproduce

运行：bash finetune_lora_single_gpu.sh 后即报错OOM

运行环境 | Environment

- OS:Linux version 5.4.0-152-generic (buildd@lcy02-amd64-051) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #169~18.04.1-Ubuntu SMP Wed Jun 7 22:22:24 UTC 2023
- Python: 3.10.11
- Transformers:4.32.0
- PyTorch:2.0.1+cu117
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

jklj077 commented 8 months ago

没有1.4B模型
如果是1.8B模型，看README：微调基模型的话是LoRA (emb) 12GB不够的；微调Chat模型可以，但你的模型路径里没有chat。