OptimalScale / LMFlow

An Extensible Toolkit for Finetuning and Inference of Large Foundation Models. Large Models for All.
https://optimalscale.github.io/LMFlow/
Apache License 2.0
8.11k stars 819 forks source link

Weird Loss Curve #831

Open Zihang-Xu-2002 opened 1 month ago

Zihang-Xu-2002 commented 1 month ago

I trained the llama3 on my own conversation dataset with the command : ./scripts/run_finetune.sh \ --model_name_or_path meta-llama/Meta-Llama-3-8B \ --dataset_path data/alpaca_selected/train \ --conversation_template llama3 \ --output_model_path output_models/finetuned_llama3_8b_selected

The initial learning rate is 2e-5 and batchsize_per_device is 4 And I found there are sharp drops at the beginning of every epoch. But during the epoch, there's no obvious loss drop.

image

Before this I trained llama2 ./scripts/run_finetune.sh \ --model_name_or_path meta-llama/Llama-2-7b-hf \ --dataset_path data/alpaca_raw/train \ --conversation_template llama2 \ --output_model_path output_models/finetuned_llama2_7b_raw

The initial learning rate is 8e-6 and batchsize_per_device is 4. The loss looks like :

image

I am not sure if the gradient accumulation leads to this. I modified the "gradient_accumulation_steps" in configs/ds_config_zero3.json to 1 . But there's no changes.

image

Could you help me with this issue? Thank you for your time and attention.

research4pan commented 1 month ago

Thanks for your interest in LMFlow! We've observed similar loss curves in some of our experiments. After careful examination, we attributed this to the overfitting of instruction following dataset on llama models. Inside each epoch, the flattened loss curve may come from the large variance of the dataset, decreasing the learning rate or increasing the batch size should help, though the overall tendency should remain the same.

You may check your evaluation/test results, if the results are normal then it may not be a serious issue 😄