UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.5k stars 2.4k forks source link

Poor performance when learning more than 10% step #2815

Closed daegonYu closed 5 days ago

daegonYu commented 1 week ago

hello. I discovered something strange during pre-training using contrastive. This means that as learning progresses, the model’s performance decreases. A common finding among multiple experiments is that performance decreases from about 5 to 10% steps or more. It also reduces the average cosine similarity score of the benchmark dataset. For example, the average of the cosine similarity scores of the pair Anchor and Postivies and the average of the cosine similarity scores of the pair Anchor and negatives are both lowered. This phenomenon appears to be a different result in that when setting scale(temperature) = 0.01, the result should be that the cosine similarity is distributed within a distribution of 0.7 to 1.0(https://huggingface.co/intfloat/multilingual-e5-large). Details of the experiment are attached below.

base models were used. (Korean model) -> klue/roberta-large

Loss: CachedMultipleNegativesRankingLoss batch size: 8192 lr : 5e-5

Dataset:

  1. Korean Wiki -> {'title', 'content'} ( ratio: 4%)
  2. Korean News -> {'title' , 'content'} ( ratio: 93%)
  3. etc... -> {'title', 'content'} ( ratio: 3%)

Below are the results of Benchmark Evaluation every 50 steps. The best performance is the model that learned the first 50 steps.

Result in miracl(ko) ndcg@10

image

Result in mldr(ko) ndcg@10

image

This is a histogram of the cosine similarity of pos and neg at 50 steps for the miracl benchmark dataset. (0: neg, 1:pos)

image

This is a histogram of the cosine similarity of pos and neg at 550 steps for the miracl benchmark dataset. (0: neg, 1:pos)

image

We confirmed that as learning progresses, both the cosine similarities of pos and neg gradually decrease.

I get similar results even when using different base models and training data. Can you tell me why?

We desperately need help.

Thank you.

tomaarsen commented 1 week ago

Hello!

I think you should be on the right track here: your base model, dataset, and chosen loss all look solid conceptually. So, the cause would be a bit more niche. Some potential ideas:

  1. The dataset quality: Sometimes the quality of the dataset is not quite as good as what was used for training before. However, this is usually only a problem if you are continuing to finetune an existing Sentence Transformer model that was trained with high quality data. This is not the case here: you're training from a Korean roberta-large that wasn't trained for embeddings. With other words, I would not expect dataset quality to be a problem here.
  2. The learning rate: When you change the batch size, the learning rate may also need changing. A general guide is multiplying the old learning rate with sqrt(new_lr / old_lr), so if 5e-5 is a good option for a batch size of 64, then 5e-5 * 11.3 ~= 5.6e-4 might be an interesting option to try. That said, usually when my performance decreases with more training it is because the learning rate is actually too high: you can also try 5e-6 and see how that does.
  3. The loss: Although the evaluation performance on the independent benchmarks are decreasing, I'm quite interested to see if the model is actually making progress on its goal of decreasing the loss, both for training and for evaluation (you can easily make an eval_dataset by splitting 1k samples from the train_dataset for a quick test). Some cases:
    • train & eval loss are falling: then the model is making progress towards its goal of optimizing for your dataset. This might be indicative that your dataset doesn't align well with the MIRACL/MLDR datasets. (NOTE: This is very possible because MIRACL and MLDR are Information Retrieval datasets; while you're training with title-content pairs. If train & eval loss are falling, then perhaps you could try and generate queries based on your content and then train with query-context pairs instead. This aligns better with IR)
    • train loss is falling, evaluation loss is increasing: You are overfitting, causing your worse "out of distribution" performance on MIRACL/MLDR.
    • train & eval loss are increasing: The model is unable to learn, perhaps because the learning rate is too large or there is an issue in the data.

I hope this helps a bit.

Oh, another thing to consider is that by default the training arguments include learning rate warmup of 10%, this might correspond a bit with your best performance at 10%.

daegonYu commented 1 week ago

thank you I understand what you said. Thanks, it helped me a lot.

image

image

image

Train loss image

This is a graph measuring the valid loss and ndcg@10 of the miracl and midr benchmarks that were previously tested (the dataset and model mentioned in the question are different, but can be said to be almost similar).

You can see that both train loss and valid loss are trending downward. So, can you suspect overfitting? Every time I experiment, this happens after the initial 5~10% of learning (including 10% warm up). Does overfitting easily occur in Sentence BERT learning? Or, since we are currently in the title/document learning phase rather than the question/document learning phase, is it possible that performance may decrease as we use a slightly different dataset from the IR Task?

Additionally, I will train question/document formats and see if this phenomenon occurs again.

tomaarsen commented 1 week ago

Thanks for sharing your loss figures. They look very good to me, no signs of overfitting whatsoever: then the eval_... figures should start going up again, but they're nicely following a pretty standard "hockeystick pattern" of large improvements in the loss as it starts to understand the task & then slow improvements in the loss as it starts to learn the details.

Does overfitting easily occur in Sentence BERT learning?

No. I think I've only seen overfitting in Sentence Transformers when training with more than 1 epoch.

Or, since we are currently in the title/document learning phase rather than the question/document learning phase, is it possible that performance may decrease as we use a slightly different dataset from the IR Task?

I think this is what's happening indeed. I think your model is getting really good at "given the title, find the document", but that it's at the cost of "given a query, find the document". I think you'll have more luck with question-document format indeed, perhaps you can generate queries given your documents.

daegonYu commented 1 week ago

Thank you for your reply. It was very helpful. We will continue to experiment and produce good results. thank you