Diverging evaluation loss using finetuning scripts Guanaco 7b

KJ-Waller commented 1 year ago

Is anyone else having this issue when using the finetune_guanaco_7b.sh script? I keep seeing the evaluation loss diverge rather than converge. I noticed this originally trying to finetune on my own datasets, and after troubleshooting, noticed that this happens for me as well with the original training scripts provided.

guanaco-7b The training loss below: guanaco-7b-train_loss

FHL1998 commented 1 year ago

Same issue here, wonder if it is led by the token_id? I tried "self-instruct" and "alpaca", but still get the same problem, the performance even worse than using GPT2.

KJ-Waller commented 1 year ago

wonder if it is led by the token_id?

Glad to hear I'm not the only one. But what do you mean exactly by this comment?

FHL1998 commented 1 year ago

wonder if it is led by the token_id?

Glad to hear I'm not the only one. But what do you mean exactly by this comment?

Basically, I tried 4 ways to get rid of it, but none of them work:

tried different dataset formats;
lr tuning;
change the source_max_len and target_max_len;
A larger model (30b)

Now I suspect the reason for this issue is that:

The length and format of the input, as some of my inputs are long (around 500 tokens), but some of them are just 200 tokens, and I do have lots of \n and - in my dataset, so I wonder if these are the causes.
Maybe the size of the dataset is not large enough? But it is still weird that my train_loss is convergent but eval_loss increased.

KJ-Waller commented 1 year ago

wonder if it is led by the token_id?

Glad to hear I'm not the only one. But what do you mean exactly by this comment?

Basically, I tried 4 ways to get rid of it, but none of them work:

tried different dataset formats;

lr tuning;

change the source_max_len and target_max_len;

A larger model (30b)

Now I suspect the reason for this issue is that:

The length and format of the input, as some of my inputs are long (around 500 tokens), but some of them are just 200 tokens, and I do have lots of \n and - in my dataset, so I wonder if these are the causes.

Maybe the size of the dataset is not large enough? But it is still weird that my train_loss is convergent but eval_loss increased.

Interesting. I tried several things too, including some lr tuning, larger models as well as different models like falcon-7b and falcon-40b. Also played a bit with the source_max_len parameter, and used different datasets. Then I decided to train the default script without changing anything about the model or dataset, but then we still see training loss convergence but eval loss divergence.

quannguyen268 commented 1 year ago

Same issue, I think maybe the model is too large and dataset is too small, so model is overfitted into training dataset ?

griff4692 commented 1 year ago

I think these PEFT models are overfitting very quickly on small-ish datasets. I have the same issue

OfirArviv commented 1 year ago

I have the same issue, with flan-t5-xxl, ul2 an xglm. But I ran the same code without 4bit and just LORA, and the model converged normally. So it is the 4bit part.

And as far as I see the performance on the training set is decreasing as well.

ghost commented 1 year ago

@artidoro

marclove commented 1 year ago

Relieved to see I'm not the only one. 😅 I get practically the same results as you @KJ-Waller when using the Guanaco fine-tuning script for LLaMA 2, and performance on MMLU goes down. Did anybody here figure out what's going on?

artidoro commented 1 year ago

Hello! 4-bit quantization is not responsible for what you are describing here. You are observing diverging loss and oscillating MMLU for the following reasons.

In NLP eval loss is not always directly related to downstream performance (like task accuracy measured by GPT-4 eval).
Dev set of MMLU is small explaining swings in MMLU accuracy while finetuning. These values are indicative and you have to compute test set performance to have a more stable result. We use the last checkpoint for this.
As shown in our paper, finetuning on the OpenAssistant dataset significantly improves chatbot performance, as measured by Vicuna GPT-4 eval. However, it does not help much on MMLU (performance degrades or stays the same compared to no OA finetuning).

Ultimately, we showed that you should be evaluating on your downstream task when finetuning on a dataset. And you should think very carefully about what target benchmark you are optimizing as this is not always indicative of the desired performance. MMLU dev results were used to tune hyperparameters in our paper but Vicuna eval was much more relevant for chatbot evaluation.

marclove commented 1 year ago

Thank you @artidoro! I ran the finetune_llama2_guanaco_7b.sh script to try to orient myself to what I should expect with qlora before running on my own dataset and thought I had screwed something up. Very much appreciate the clarification!

I get the reasoning around not depending on MMLU metrics (is there a reason why you have it on by default in the script?), but I thought eval loss should still give some indication of overfitting/underfitting issues. Is that a misconception on my part?

waterluck commented 4 months ago

@KJ-Waller @FHL1998 i get the same loss trend when full finetune on llama 2, have you solved this problem finally?

KJ-Waller commented 4 months ago

@KJ-Waller @FHL1998 i get the same loss trend when full finetune on llama 2, have you solved this problem finally?

Sorry, I haven't worked on this in a while and didn't pursue it any further. Good luck

artidoro / qlora

Diverging evaluation loss using finetuning scripts Guanaco 7b #152