UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.19k stars 2.47k forks source link

Strategy for building sentence pairs #978

Open PaulForInvent opened 3 years ago

PaulForInvent commented 3 years ago

Let's say you have per class 10 examples of sentences and want to use a loss where you expect a sentence pair like for MultipleNegativesRankingLoss. This expects positive sentence pairs. So, is it actually advisable not to use all of the permutations of the 10 examples? You can build from 10 sentence of one class 10!/2 positive sentence pairs to use it for a anchor and its positive!

I wonder if this is really a good idea because you just have 10 sentences, but the model trains on it with exponentially blowed number of samples. I think this leads to way more overfitting (since you actually have information of this 10 sentences, but train on several hundreds samples made of them).

I wonder if there is any "right" strategy to choose the sentence pairs, like just using a fraction, or use just those examples where the same sentences occurs a single time.

nreimers commented 3 years ago

I would create a custom pytorch dataset or dataloader, that returns batches with the desired properties. No need to add all 10!/2 pairs to the standard datasets / dataloaders.

Yes, it could lead to overfitting. So maybe just sample some of the 10!/2 possible pairs for actual training.