Closed KJ-Waller closed 1 year ago
Same issue here, wonder if it is led by the token_id? I tried "self-instruct" and "alpaca", but still get the same problem, the performance even worse than using GPT2.
wonder if it is led by the token_id?
Glad to hear I'm not the only one. But what do you mean exactly by this comment?
wonder if it is led by the token_id?
Glad to hear I'm not the only one. But what do you mean exactly by this comment?
Basically, I tried 4 ways to get rid of it, but none of them work:
source_max_len
and target_max_len
;Now I suspect the reason for this issue is that:
\n
and -
in my dataset, so I wonder if these are the causes. train_loss
is convergent but eval_loss
increased.wonder if it is led by the token_id?
Glad to hear I'm not the only one. But what do you mean exactly by this comment?
Basically, I tried 4 ways to get rid of it, but none of them work:
- tried different dataset formats;
- lr tuning;
- change the
source_max_len
andtarget_max_len
;- A larger model (30b)
Now I suspect the reason for this issue is that:
- The length and format of the input, as some of my inputs are long (around 500 tokens), but some of them are just 200 tokens, and I do have lots of
\n
and-
in my dataset, so I wonder if these are the causes.- Maybe the size of the dataset is not large enough? But it is still weird that my
train_loss
is convergent buteval_loss
increased.
Interesting. I tried several things too, including some lr tuning, larger models as well as different models like falcon-7b and falcon-40b. Also played a bit with the source_max_len
parameter, and used different datasets. Then I decided to train the default script without changing anything about the model or dataset, but then we still see training loss convergence but eval loss divergence.
Same issue, I think maybe the model is too large and dataset is too small, so model is overfitted into training dataset ?
I think these PEFT models are overfitting very quickly on small-ish datasets. I have the same issue
I have the same issue, with flan-t5-xxl, ul2 an xglm. But I ran the same code without 4bit and just LORA, and the model converged normally. So it is the 4bit part.
And as far as I see the performance on the training set is decreasing as well.
@artidoro
Relieved to see I'm not the only one. 😅 I get practically the same results as you @KJ-Waller when using the Guanaco fine-tuning script for LLaMA 2, and performance on MMLU goes down. Did anybody here figure out what's going on?
Hello! 4-bit quantization is not responsible for what you are describing here. You are observing diverging loss and oscillating MMLU for the following reasons.
Ultimately, we showed that you should be evaluating on your downstream task when finetuning on a dataset. And you should think very carefully about what target benchmark you are optimizing as this is not always indicative of the desired performance. MMLU dev results were used to tune hyperparameters in our paper but Vicuna eval was much more relevant for chatbot evaluation.
Thank you @artidoro! I ran the finetune_llama2_guanaco_7b.sh
script to try to orient myself to what I should expect with qlora before running on my own dataset and thought I had screwed something up. Very much appreciate the clarification!
I get the reasoning around not depending on MMLU metrics (is there a reason why you have it on by default in the script?), but I thought eval loss should still give some indication of overfitting/underfitting issues. Is that a misconception on my part?
@KJ-Waller @FHL1998 i get the same loss trend when full finetune on llama 2, have you solved this problem finally?
@KJ-Waller @FHL1998 i get the same loss trend when full finetune on llama 2, have you solved this problem finally?
Sorry, I haven't worked on this in a while and didn't pursue it any further. Good luck
Is anyone else having this issue when using the finetune_guanaco_7b.sh script? I keep seeing the evaluation loss diverge rather than converge. I noticed this originally trying to finetune on my own datasets, and after troubleshooting, noticed that this happens for me as well with the original training scripts provided.
The training loss below: