Open PaulForInvent opened 3 years ago
I would create a custom pytorch dataset or dataloader, that returns batches with the desired properties. No need to add all 10!/2 pairs to the standard datasets / dataloaders.
Yes, it could lead to overfitting. So maybe just sample some of the 10!/2 possible pairs for actual training.
Let's say you have per class 10 examples of sentences and want to use a loss where you expect a sentence pair like for MultipleNegativesRankingLoss. This expects positive sentence pairs. So, is it actually advisable not to use all of the permutations of the 10 examples? You can build from 10 sentence of one class 10!/2 positive sentence pairs to use it for a anchor and its positive!
I wonder if this is really a good idea because you just have 10 sentences, but the model trains on it with exponentially blowed number of samples. I think this leads to way more overfitting (since you actually have information of this 10 sentences, but train on several hundreds samples made of them).
I wonder if there is any "right" strategy to choose the sentence pairs, like just using a fraction, or use just those examples where the same sentences occurs a single time.