Open koen-dejonghe opened 1 month ago
MNR loss is often used with a batch loader that excludes duplicates (like in the script you linked). During training, a string will be loaded just once, subsequent appearances of the same anchor/positive/negative will be skipped.
"subsequent appearances of the same anchor/positive/negative will be skipped" That is correct, but it's not what I'm saying.
The anchors by themselves should only occur once. Same for the positives.
Maybe I misunderstood your question? But to me it still seems like the answer is the same. The duplicates are detected at "string level" not at "triplets level". Meaning that you can't have two same strings in the same batch (doesn't matter if the string is an anchor, a positive or a negative). So even if your dataset has multiple triplets in which the same anchor appear, only the first instance will be loaded (same for positive and negative)
I assume you're referring to
batch_sampler=BatchSamplers.NO_DUPLICATES
As far as I understand it, this means no duplicate samples in the same batch.
But a sample here consists of a triplet, no?
I apologize! My explaining is outdated. I was using ST pre v3.0, where you had to provide the dataloader inside the .fit() function, and the NoDuplicatesDataLoader would work like I stated. With the new data sampler it seems that the duplicates are detected at triplet level, so your issue is indeed present in the new versions.
MultipleNegativesRankingLoss says: The loss will use for the pair (a_i, p_i) all p_j for j != i and all n_j as negatives.
I think this implies that all anchors must be 1) unique and 2) not-positives of each other.
However looking at dataset
sentence-transformers/quora-duplicates-mining
used in the example training_MultipleNegativesRankingLoss.py this is not the case. The same anchors occur multiple times, and so do some of the positives. Shouldn't this dataset be preprocessed a bit before using it with MultipleNegativesRankingLoss?Thank you.