Closed daegonYu closed 5 days ago
Hello!
I think you should be on the right track here: your base model, dataset, and chosen loss all look solid conceptually. So, the cause would be a bit more niche. Some potential ideas:
sqrt(new_lr / old_lr)
, so if 5e-5 is a good option for a batch size of 64, then 5e-5 * 11.3 ~= 5.6e-4 might be an interesting option to try. That said, usually when my performance decreases with more training it is because the learning rate is actually too high: you can also try 5e-6 and see how that does.I hope this helps a bit.
Oh, another thing to consider is that by default the training arguments include learning rate warmup of 10%, this might correspond a bit with your best performance at 10%.
thank you I understand what you said. Thanks, it helped me a lot.
Train loss
This is a graph measuring the valid loss and ndcg@10 of the miracl and midr benchmarks that were previously tested (the dataset and model mentioned in the question are different, but can be said to be almost similar).
You can see that both train loss and valid loss are trending downward. So, can you suspect overfitting? Every time I experiment, this happens after the initial 5~10% of learning (including 10% warm up). Does overfitting easily occur in Sentence BERT learning? Or, since we are currently in the title/document learning phase rather than the question/document learning phase, is it possible that performance may decrease as we use a slightly different dataset from the IR Task?
Additionally, I will train question/document formats and see if this phenomenon occurs again.
Thanks for sharing your loss figures. They look very good to me, no signs of overfitting whatsoever: then the eval_... figures should start going up again, but they're nicely following a pretty standard "hockeystick pattern" of large improvements in the loss as it starts to understand the task & then slow improvements in the loss as it starts to learn the details.
Does overfitting easily occur in Sentence BERT learning?
No. I think I've only seen overfitting in Sentence Transformers when training with more than 1 epoch.
Or, since we are currently in the title/document learning phase rather than the question/document learning phase, is it possible that performance may decrease as we use a slightly different dataset from the IR Task?
I think this is what's happening indeed. I think your model is getting really good at "given the title, find the document", but that it's at the cost of "given a query, find the document". I think you'll have more luck with question-document format indeed, perhaps you can generate queries given your documents.
Thank you for your reply. It was very helpful. We will continue to experiment and produce good results. thank you
hello. I discovered something strange during pre-training using contrastive. This means that as learning progresses, the model’s performance decreases. A common finding among multiple experiments is that performance decreases from about 5 to 10% steps or more. It also reduces the average cosine similarity score of the benchmark dataset. For example, the average of the cosine similarity scores of the pair Anchor and Postivies and the average of the cosine similarity scores of the pair Anchor and negatives are both lowered. This phenomenon appears to be a different result in that when setting scale(temperature) = 0.01, the result should be that the cosine similarity is distributed within a distribution of 0.7 to 1.0(https://huggingface.co/intfloat/multilingual-e5-large). Details of the experiment are attached below.
base models were used. (Korean model) -> klue/roberta-large
Loss: CachedMultipleNegativesRankingLoss batch size: 8192 lr : 5e-5
Dataset:
Below are the results of Benchmark Evaluation every 50 steps. The best performance is the model that learned the first 50 steps.
Result in miracl(ko) ndcg@10
Result in mldr(ko) ndcg@10
This is a histogram of the cosine similarity of pos and neg at 50 steps for the miracl benchmark dataset. (0: neg, 1:pos)
This is a histogram of the cosine similarity of pos and neg at 550 steps for the miracl benchmark dataset. (0: neg, 1:pos)
We confirmed that as learning progresses, both the cosine similarities of pos and neg gradually decrease.
I get similar results even when using different base models and training data. Can you tell me why?
We desperately need help.
Thank you.