artidoro / qlora

QLoRA: Efficient Finetuning of Quantized LLMs
https://arxiv.org/abs/2305.14314
MIT License
9.96k stars 820 forks source link

Question: CUDA memory usage in the evaluation phase #261

Open LimboWK opened 1 year ago

LimboWK commented 1 year ago

I have a customized SFT and evaluation scripts using QLora but I got GPU memory not enough problem in the evaluation steps, does anyone have the same issue or any insights on how to reduce the usage in the eval steps.

the trainer and dataset looks like this:

####################################################################### gradient_accumulation_steps = 4 per_device_train_batch_size = 4 per_device_eval_batch_size = 1 total_train_samples = len(train_data) total_validation_samples = len(validation_data) print(" Total training samples:", total_train_samples) print(" Total validation samples:", total_validation_samples)

num_train_steps_per_epoch = (total_train_samples // per_device_train_batch_size // gradient_accumulation_steps) print('* num_train_steps_per_epoch: ', num_train_steps_per_epoch) num_train_epochs = 1 max_steps = int(num_train_epochs num_train_steps_per_epoch) print(' Max steps:', max_steps)

trainer

trainer = transformers.Trainer( model=model, train_dataset=train_data, eval_dataset=validation_data, compute_metrics=compute_bleu_score, data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False), args=transformers.TrainingArguments( per_device_train_batch_size=per_device_train_batch_size, per_device_eval_batch_size=per_device_eval_batch_size, gradient_accumulation_steps=gradient_accumulation_steps, warmup_steps=2, max_steps=max_steps, learning_rate=1e-4, evaluation_strategy="steps", eval_steps=50,
save_steps=50, logging_steps=10, save_total_limit=2, fp16=True, output_dir="outputs", optim="paged_adamw_8bit" ), ) model.config.use_cache = False

jonataslaw commented 1 year ago

per_device_train_batch_size = 1

ChenMnZ commented 9 months ago

I also encountered this problem. Did you solve it later?