THUDM / GLM-4

GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
Apache License 2.0
3.28k stars 235 forks source link

单卡3090ti进行lora微调,遇到了OOM问题 #228

Closed RyanCcc114 closed 4 days ago

RyanCcc114 commented 1 week ago

System Info / 系統信息

torch2.1.0,硬件信息:单卡3090ti

lora.yaml training_args:

see transformers.Seq2SeqTrainingArguments

output_dir: ./output max_steps: 27000

needed to be fit for the dataset

learning_rate: 5e-4

settings for data loading

per_device_train_batch_size: 1 dataloader_num_workers: 16 remove_unused_columns: false

settings for saving checkpoints

save_strategy: steps save_steps: 500

settings for logging

log_level: info logging_strategy: steps logging_steps: 10

settings for evaluation

per_device_eval_batch_size: 2 evaluation_strategy: steps eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

predict_with_generate: true

see transformers.GenerationConfig

generation_config: max_new_tokens: 512

set your absolute deepspeed path here

deepspeed: ds_zero_2.json

peft_config: peft_type: LORA task_type: CAUSAL_LM r: 8 lora_alpha: 32 lora_dropout: 0.1

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

Reproduction / 复现过程

使用9000条数据进行训练的时候,出现了内存溢出问题

OutOfMemoryError: CUDA out of memory. Tried to allocate 12.00 MiB. GPU 0 has a total capacty of 23.99 GiB of which 0 bytes is free. Including non-PyTorch memory, this process has 17179869184.00 GiB memory in use. Of the allocated memory 23.01 GiB is allocated by PyTorch, and 195.10 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Expected behavior / 期待表现

希望能够正常执行微调

zRzRzRzRzRzRzR commented 6 days ago

可以直接执行非ds的任务吧,还是都报错呢

RyanCcc114 commented 6 days ago

可以直接执行非ds的任务吧,还是都报错呢

执行非ds任务会报错,是在wsl环境中进行微调。 奇怪的是使用官方微调脚本会爆显存,但是在llama-factory中微调则不会爆显存。

zRzRzRzRzRzRzR commented 6 days ago

更新了最新的微调代码吗,老代码确实可能会爆显存 wsl没测过,我都是纯linux开发的

RyanCcc114 commented 6 days ago

更新最新微调代码后,开始训练时loss一直为0

zRzRzRzRzRzRzR commented 5 days ago

确定你是使用BF16精度微调

Yang-125 commented 5 days ago

确定你是使用BF16精度微调

请问是在哪个地方确定用BF16精度微调呀?

RyanCcc114 commented 5 days ago

确定你是使用BF16精度微调

已在lora配置文件中指定了bf16字段为true

lora.yaml data_config: train_file: train.jsonl val_file: dev.jsonl test_file: dev.jsonl num_proc: 1 max_input_length: 512 max_output_length: 512 training_args:

see transformers.Seq2SeqTrainingArguments

output_dir: ./output max_steps: 3000

needed to be fit for the dataset

bf16: true learning_rate: 5e-4

settings for data loading

per_device_train_batch_size: 1 dataloader_num_workers: 16 remove_unused_columns: false

settings for saving checkpoints

save_strategy: steps save_steps: 500

settings for logging

log_level: info logging_strategy: steps logging_steps: 10

settings for evaluation

per_device_eval_batch_size: 4 evaluation_strategy: steps eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

predict_with_generate: true

see transformers.GenerationConfig

generation_config: max_new_tokens: 512

set your absolute deepspeed path here

deepspeed: ds_zero_2.json

peft_config: peft_type: LORA task_type: CAUSAL_LM r: 8 lora_alpha: 32 lora_dropout: 0.1

Yang-125 commented 5 days ago

确定你是使用BF16精度微调

已在lora配置文件中指定了bf16字段为true

lora.yaml data_config: train_file: train.jsonl val_file: dev.jsonl test_file: dev.jsonl num_proc: 1 max_input_length: 512 max_output_length: 512 training_args: #see transformers.Seq2SeqTrainingArguments output_dir: ./output max_steps: 3000 #needed to be fit for the dataset bf16: true learning_rate: 5e-4 #settings for data loading per_device_train_batch_size: 1 dataloader_num_workers: 16 remove_unused_columns: false #settings for saving checkpoints save_strategy: steps save_steps: 500 #settings for logging log_level: info logging_strategy: steps logging_steps: 10 #settings for evaluation per_device_eval_batch_size: 4 evaluation_strategy: steps eval_steps: 500 #settings for optimizer #adam_epsilon: 1e-6 #uncomment the following line to detect nan or inf values #debug: underflow_overflow predict_with_generate: true #see transformers.GenerationConfig generation_config: max_new_tokens: 512 #set your absolute deepspeed path here #deepspeed: ds_zero_2.json peft_config: peft_type: LORA task_type: CAUSAL_LM r: 8 lora_alpha: 32 lora_dropout: 0.1

data_config: train_file: train.jsonl val_file: dev.jsonl test_file: dev.jsonl num_proc: 1 max_input_length: 512 max_output_length: 512 training_args:

see transformers.Seq2SeqTrainingArguments

output_dir: ./output max_steps: 3000

needed to be fit for the dataset

bf16: true learning_rate: 5e-4

settings for data loading

per_device_train_batch_size: 1 dataloader_num_workers: 16 remove_unused_columns: false

settings for saving checkpoints

save_strategy: steps save_steps: 500

settings for logging

log_level: info logging_strategy: steps logging_steps: 10

settings for evaluation

per_device_eval_batch_size: 4 evaluation_strategy: steps eval_steps: 500

settings for optimizer

adam_epsilon: 1e-6

uncomment the following line to detect nan or inf values

debug: underflow_overflow

predict_with_generate: true

see transformers.GenerationConfig

generation_config: max_new_tokens: 512

set your absolute deepspeed path here

deepspeed: /home/yqx/workspace/Compared/GLM-4/finetune_demo/configs/ds_zero_2.json peft_config: peft_type: LORA task_type: CAUSAL_LM r: 8 lora_alpha: 32 lora_dropout: 0.1 您的deepspeed 文件注释掉了 是否需要加上 以及我这边加上了之后loss还是一直输出为0.0

zRzRzRzRzRzRzR commented 5 days ago

要加上 能截图看一下数据集载入的运行截图吗

Yang-125 commented 5 days ago

要加上 能截图看一下数据集载入的运行截图吗

可以的! ![Uploading 截屏2024-06-26 15.41.38.png…]()

RyanCcc114 commented 5 days ago

使用同样的微调配置文件,旧版本的微调代码会爆显存,新版本的loss为0 Snipaste_2024-06-26_15-39-31

zRzRzRzRzRzRzR commented 4 days ago

你的数据集内容有被正常识别嘛,我建议在开始微调之前,你check一下apply chat template后label的部分

RyanCcc114 commented 4 days ago

经调试发现是构建input_ids时出错 finetune.py的process_batch函数内的new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[2:]改为new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[0][2:]即可成功读取input和label,修改process_batch_eval同样有效

RyanCcc114 commented 4 days ago

但是修改后还是出现了OOM问题😂

经调试发现是构建input_ids时出错 finetune.py的process_batch函数内的new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[2:]改为new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[0][2:]即可成功读取input和label,修改process_batch_eval同样有效

Yang-125 commented 4 days ago

但是修改后还是出现了OOM问题😂

经调试发现是构建input_ids时出错 finetune.py的process_batch函数内的new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[2:]改为new_input_ids = tokenizer.apply_chat_template([message], tokenize=True, return_dict=False)[0][2:]即可成功读取input和label,修改process_batch_eval同样有效

非常感谢,成功读取到了;OOM的原因可能是数据集太大了以及训练轮次导致;这边是多卡微调的时候由于其中某一张卡占用不够也报错了,但是限制这张卡不使用的时候就成功运行了。