I have a customized SFT and evaluation scripts using QLora but I got GPU memory not enough problem in the evaluation steps, does anyone have the same issue or any insights on how to reduce the usage in the eval steps.
the trainer and dataset looks like this:
#######################################################################
gradient_accumulation_steps = 4
per_device_train_batch_size = 4
per_device_eval_batch_size = 1
total_train_samples = len(train_data)
total_validation_samples = len(validation_data)
print(" Total training samples:", total_train_samples)
print(" Total validation samples:", total_validation_samples)
I have a customized SFT and evaluation scripts using QLora but I got GPU memory not enough problem in the evaluation steps, does anyone have the same issue or any insights on how to reduce the usage in the eval steps.
the trainer and dataset looks like this:
####################################################################### gradient_accumulation_steps = 4 per_device_train_batch_size = 4 per_device_eval_batch_size = 1 total_train_samples = len(train_data) total_validation_samples = len(validation_data) print(" Total training samples:", total_train_samples) print(" Total validation samples:", total_validation_samples)
num_train_steps_per_epoch = (total_train_samples // per_device_train_batch_size // gradient_accumulation_steps) print('* num_train_steps_per_epoch: ', num_train_steps_per_epoch) num_train_epochs = 1 max_steps = int(num_train_epochs num_train_steps_per_epoch) print(' Max steps:', max_steps)
trainer
trainer = transformers.Trainer( model=model, train_dataset=train_data, eval_dataset=validation_data, compute_metrics=compute_bleu_score, data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False), args=transformers.TrainingArguments( per_device_train_batch_size=per_device_train_batch_size, per_device_eval_batch_size=per_device_eval_batch_size, gradient_accumulation_steps=gradient_accumulation_steps, warmup_steps=2, max_steps=max_steps, learning_rate=1e-4, evaluation_strategy="steps", eval_steps=50,
save_steps=50, logging_steps=10, save_total_limit=2, fp16=True, output_dir="outputs", optim="paged_adamw_8bit" ), ) model.config.use_cache = False