Question: MultipleNegativesRankingLoss and dataset quora-duplicates-mining: preprocessing needed?

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

https://www.sbert.net

Apache License 2.0

14.85k stars 2.44k forks source link

Question: MultipleNegativesRankingLoss and dataset quora-duplicates-mining: preprocessing needed? #2870

Open koen-dejonghe opened 1 month ago

koen-dejonghe commented 1 month ago

MultipleNegativesRankingLoss says: The loss will use for the pair (a_i, p_i) all p_j for j != i and all n_j as negatives.

I think this implies that all anchors must be 1) unique and 2) not-positives of each other.

However looking at dataset sentence-transformers/quora-duplicates-mining used in the example training_MultipleNegativesRankingLoss.py this is not the case. The same anchors occur multiple times, and so do some of the positives. Shouldn't this dataset be preprocessed a bit before using it with MultipleNegativesRankingLoss?

Thank you.

rispoli997 commented 1 month ago

MNR loss is often used with a batch loader that excludes duplicates (like in the script you linked). During training, a string will be loaded just once, subsequent appearances of the same anchor/positive/negative will be skipped.

koen-dejonghe commented 1 month ago

"subsequent appearances of the same anchor/positive/negative will be skipped" That is correct, but it's not what I'm saying.

The anchors by themselves should only occur once. Same for the positives.

rispoli997 commented 1 month ago

Maybe I misunderstood your question? But to me it still seems like the answer is the same. The duplicates are detected at "string level" not at "triplets level". Meaning that you can't have two same strings in the same batch (doesn't matter if the string is an anchor, a positive or a negative). So even if your dataset has multiple triplets in which the same anchor appear, only the first instance will be loaded (same for positive and negative)

koen-dejonghe commented 1 month ago

I assume you're referring to batch_sampler=BatchSamplers.NO_DUPLICATES As far as I understand it, this means no duplicate samples in the same batch. But a sample here consists of a triplet, no?

rispoli997 commented 1 month ago

I apologize! My explaining is outdated. I was using ST pre v3.0, where you had to provide the dataloader inside the .fit() function, and the NoDuplicatesDataLoader would work like I stated. With the new data sampler it seems that the duplicates are detected at triplet level, so your issue is indeed present in the new versions.