4张3090 lora微调时报错 OutOfMemoryError: CUDA out of memory. Tried to allocate 214.00 MiB. GPU

ShepherdX commented 1 month ago

System Info / 系統信息

CUDA Version: 12.2, transformers Version: 4.42.2, Python 3.10.12
batch size设置为1，max_input_length: 4096, max_output_length: 2048

使用lora微调时，出现OOM。

配置文件如下

data_config:
  train_file: train.jsonl
  val_file: test.jsonl
  test_file: test.jsonl
  num_proc: 1
max_input_length: 4096
max_output_length: 2048
training_args:
  # see `transformers.Seq2SeqTrainingArguments`
  output_dir: ./output
  num_train_epochs: 1
  # needed to be fit for the dataset
  learning_rate: 5e-4
  # settings for data loading
  per_device_train_batch_size: 1
  gradient_accumulation_steps: 8
  dataloader_num_workers: 8
  remove_unused_columns: false
  # settings for saving checkpoints
  save_strategy: steps
  save_steps: 10
  save_total_limit: 10
  # settings for logging
  log_level: info
  logging_strategy: steps
  logging_steps: 1
  # settings for evaluation
  per_device_eval_batch_size: 1
  eval_strategy: steps
  eval_steps: 10
  # settings for optimizer
  adam_epsilon: 1e-6
  warmup_ratio: 0.01
  # uncomment the following line to detect nan or inf values
  # debug: underflow_overflow
  predict_with_generate: true
  # see `transformers.GenerationConfig`
  generation_config:
    max_new_tokens: 2048
  # set your absolute deepspeed path here
  deepspeed: configs/ds_zero_3.json
peft_config:
  peft_type: LORA
  task_type: CAUSAL_LM
  r: 8
  lora_alpha: 32
  lora_dropout: 0.1
  target_modules: ["query_key_value"]

报错信息

[rank0]: │ tition_parameters.py:1235 in all_gather_coalesced                                                │
[rank0]: │                                                                                                  │
[rank0]: │   1232 │   │   │   │   │   buffer_size = param.ds_secondary_tensor.shape[0] * world_size  #make  │
[rank0]: │   1233 │   │   │   │                                                                             │
[rank0]: │   1234 │   │   │   │   param_ds_tensor = param.ds_secondary_tensor if use_secondary_tensor else  │
[rank0]: │ ❱ 1235 │   │   │   │   param_buffer = torch.empty(                                               │
[rank0]: │   1236 │   │   │   │   │   buffer_size,                                                          │
[rank0]: │   1237 │   │   │   │   │   dtype=param_ds_tensor.dtype if not quantize else torch.int8,          │
[rank0]: │   1238 │   │   │   │   │   device=get_accelerator().current_device_name(),                       │
[rank0]: ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
[rank0]: OutOfMemoryError: CUDA out of memory. Tried to allocate 214.00 MiB. GPU
  0%|          | 0/3 [00:07<?, ?it/s]
W0722 11:38:33.647000 140524727420736 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 89102 closing signal SIGTERM
W0722 11:38:33.648000 140524727420736 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 89103 closing signal SIGTERM
W0722 11:38:33.649000 140524727420736 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 89104 closing signal SIGTERM
E0722 11:38:34.130000 140524727420736 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 89101) of binary: 
Traceback (most recent call last):

Who can help? / 谁可以帮助到您？

No response

Information / 问题信息

[ ] The official example scripts / 官方的示例脚本
[X] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

准备训练数据
启动微调任务：OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=4finetune.py

Expected behavior / 期待表现

顺利完成微调任务

zRzRzRzRzRzRzR commented 1 month ago

或许你应该看一下readme中SFT需要的配置

ShepherdX commented 1 month ago

意思是max_length增大到4096之后，需要使用A100才能做lora微调吗？

ShepherdX commented 1 month ago

或许你应该看一下readme中SFT需要的配置

想请教这个这个显存占用主要是哪部分？因为同样的数据使用llama3-8B微调，最大长度可以支持到8k。

morettt commented 1 month ago

你可能应该看一下readme中SFT需要的配置

他的这个训练配置文件是lora.yaml吧

feihuamantian commented 1 month ago

https://github.com/OpenGVLab/InternVL/issues/351

youkuxiaobin commented 1 month ago

遇到相同问题，多块显卡还CUDA out of memory.

elesun2018 commented 2 weeks ago

GLM-4.0731.gitraw per_device_train_batch_size: 1 writer_batch_size=1 batch_size=1 GLM-4/finetune_demo# CUDA_VISIBLE_DEVICES=1 python finetune_vision.py 报错 OutOfMemoryError: CUDA out of memory. Tried to allocate 1.22 GiB (GPU 0; 47.54 GiB total capacity; 44.83 GiB already allocated; 1.07 GiB free; 46.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

elesun2018 commented 2 weeks ago

OutOfMemoryError: CUDA out of memory的问题请问是不是要更新代码还是要更新模型文件？

elesun2018 commented 2 weeks ago

0821更新代码和更新模型文件没有用，仍然OutOfMemoryError 请问finetune最低显存多少？应该如何配置？这个问题还应该从何排查（环境版本？显存爆处的代码）

THUDM / GLM-4