UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.85k stars 2.44k forks source link

Training SBERT for Information Retrieval using MultipleNegativesRankingLoss with different set of batch sizes #1189

Open Nicolabo opened 2 years ago

Nicolabo commented 2 years ago

I was trying to mimic the example of Quora Duplicated Questions to my use case. However, one of important point you make in documentation is:

Note 2: MultipleNegativesRankingLoss only works if (a_i, b_j) with j != i is actually a negative, non-duplicate question pair. In few instances, this assumption is wrong. But in the majority of cases, if we sample two random questions, they are not duplicates. If your dataset cannot fullfil this property, MultipleNegativesRankingLoss might not work well.

The thing is, in my case, I have more duplicates than in Quora dataset, however, I was thinking if I will control how the batches are created (to take care that always only one pair of duplicates exist within one batch) maybe it might work. Actually , I've already created a process to split data into batches in that way, but I noticed that it will not always be possible to have equally sized batches to meet the criterium.

For example, If I want to create batches with size = 64, it might turn out that the last ones will contains less than 64 elements. Do you think it might be a problem to train the model on different batch sizes (e.g. the most with 64 elements, but some with 40 elements, 34 etc.) or it's better to have one batch size than but smaller (e.g. batch_size = 32 that will work in this case on all data).

I am asking because of your Note 1:

Note 1: Increasing the batch sizes usually yields better results, as the task gets harder.

Thanks,

nreimers commented 2 years ago

This is usually not a problem.

You could set for the DataLoader the parameter drop_last=True, then it will drop the last mini-batch with a size smaller 64.

Nicolabo commented 2 years ago

Right. But I was not thinking about dropping such smaller batches, but actually include them. For example, from my dataset, I have 70% with batch_size = 64 , but the rest 30% is ranging between 34 and 64 elements. I do not want to remove them, but maybe I should. That's my question.

nreimers commented 2 years ago

You can keep them.

Nicolabo commented 2 years ago

Cool thanks :)