QwenLM / Qwen

The official repo of Qwen (通义千问) chat & pretrained large language model proposed by Alibaba Cloud.
Apache License 2.0
13.59k stars 1.11k forks source link

[BUG] <3090 12G 单卡Lora微调1.4B模型OOM> #1002

Closed lizhili closed 8 months ago

lizhili commented 8 months ago

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

运行脚本:finetune_lora_single_gpu.sh 后报错。 脚本是于1.23日拉取,内容为: ` export CUDA_DEVICE_MAX_CONNECTIONS=1

MODEL="/home/Qwen/Qwen1.4/model" # Set the path if you do not want to load from huggingface directly

ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.

See the section for finetuning in README for more information.

DATA="/home/LLM_poc/output.json"

function usage() { echo ' Usage: bash finetune/finetune_lora_single_gpu.sh [-m MODEL_PATH] [-d DATA_PATH] ' }

while [[ "$1" != "" ]]; do case $1 in -m | --model ) shift MODEL=$1 ;; -d | --data ) shift DATA=$1 ;; -h | --help ) usage exit 0 ;;

export CUDA_VISIBLE_DEVICES=0

python finetune.py \ --model_name_or_path $MODEL \ --data_path $DATA \ --bf16 True \ --output_dir output_qwen \ --num_train_epochs 5 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 8 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 1000 \ --save_total_limit 10 \ --learning_rate 3e-4 \ --weight_decay 0.1 \ --adam_beta2 0.95 \ --warmup_ratio 0.01 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --report_to "none" \ --model_max_length 128 \ --lazy_preprocess True \ --gradient_checkpointing \ --use_lora `

显卡情况: ` Wed Jan 24 10:20:49 2024 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.116.04 Driver Version: 525.116.04 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:2D:00.0 Off | N/A | | 30% 42C P0 120W / 350W | 296MiB / 12288MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:99:00.0 Off | N/A | | 30% 41C P0 122W / 350W | 10MiB / 12288MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 5165 G /usr/lib/xorg/Xorg 18MiB | | 0 N/A N/A 5200 G /usr/bin/gnome-shell 70MiB | | 0 N/A N/A 7879 G ...on=20240118-080138.585000 31MiB | | 0 N/A N/A 17079 G /usr/lib/xorg/Xorg 110MiB | | 0 N/A N/A 17207 G /usr/bin/gnome-shell 62MiB | | 1 N/A N/A 5165 G /usr/lib/xorg/Xorg 4MiB | | 1 N/A N/A 17079 G /usr/lib/xorg/Xorg 4MiB | +-----------------------------------------------------------------------------+ `

报错: [2024-01-24 10:22:49,095] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect) The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained". Try importing flash-attention for faster inference... Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 8.22it/s] trainable params: 676,003,840 || all params: 2,512,832,512 || trainable%: 26.902065170350678 Loading data... Formatting inputs...Skip in lazy mode Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. 0%| | 0/30510 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/cajr/lizl/Qwen/Qwen1.4/Qwen-main/finetune/finetune.py", line 374, in <module> train() File "/home/cajr/lizl/Qwen/Qwen1.4/Qwen-main/finetune/finetune.py", line 367, in train trainer.train() File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train return inner_training_loop( File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/transformers/trainer.py", line 1916, in _inner_training_loop self.optimizer.step() File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/optimizer.py", line 145, in step self.optimizer.step(closure) File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper return wrapped(*args, **kwargs) File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(*args, **kwargs) File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 33, in _use_grad ret = func(self, *args, **kwargs) File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/torch/optim/adamw.py", line 171, in step adamw( File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/torch/optim/adamw.py", line 321, in adamw func( File "/home/cajr/miniconda3/envs/llm/lib/python3.10/site-packages/torch/optim/adamw.py", line 566, in _multi_tensor_adamw denom = torch._foreach_add(exp_avg_sq_sqrt, eps) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 11.76 GiB total capacity; 10.36 GiB already allocated; 22.06 MiB free; 10.45 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 0%|

期望行为 | Expected Behavior

能够正常跑通训练。

复现方法 | Steps To Reproduce

运行:bash finetune_lora_single_gpu.sh 后即报错OOM

运行环境 | Environment

- OS:Linux version 5.4.0-152-generic (buildd@lcy02-amd64-051) (gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)) #169~18.04.1-Ubuntu SMP Wed Jun 7 22:22:24 UTC 2023
- Python: 3.10.11
- Transformers:4.32.0
- PyTorch:2.0.1+cu117
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

jklj077 commented 8 months ago
  1. 没有1.4B模型
  2. 如果是1.8B模型,看README:微调基模型的话是LoRA (emb) 12GB不够的;微调Chat模型可以,但你的模型路径里没有chat。
lizhili commented 8 months ago

非常感谢,更换为1.8-chat后,lora和lora-single都可以跑通,观察到一个现象,lora相比与lora-single,不仅同时占用两张卡的显存,同时比后者占用显存还多。 lora:(7019MiB , 6737MiB)-> lora-single: ( 6505MiB, 10MiB) 请问这个现象正常吗?

jklj077 commented 8 months ago

没差太多,大概率正常。配置的batch大小是per gpu的,实际batch size翻倍了。 lora-single另外一卡上的10MiB跟finetune无关,那个是xorg占用的。