Open svjack opened 3 years ago
Hi @svjack For (a_i, p_j) for i != j and i,j random: The probability that this is a positive pair is rather small. There are 500k queries, so selecting another one that is actually a positive one is greater 0, but it is so small that it does not affect the training.
So yes, some batches will violate this constrain, but the amount it too small to have any impact.
This is specific for MS Marco. If you have another dataset, were there is a high chance that a random a_i and a random p_j are actually a positive pair, you have to implement a collect / sampler function that ensures that this does not happen
Hi @svjack
For (a_i, p_j) for i != j and i,j random: The probability that this is a positive pair is rather small. There are 500k queries, so selecting another one that is actually a positive one is greater 0, but it is so small that it does not affect the training.
So yes, some batches will violate this constrain, but the amount it too small to have any impact.
This is specific for MS Marco. If you have another dataset, were there is a high chance that a random a_i and a random p_j are actually a positive pair, you have to implement a collect / sampler function that ensures that this does not happen
thanks for your reply.
Hi @svjack
For (a_i, p_j) for i != j and i,j random: The probability that this is a positive pair is rather small. There are 500k queries, so selecting another one that is actually a positive one is greater 0, but it is so small that it does not affect the training.
So yes, some batches will violate this constrain, but the amount it too small to have any impact.
This is specific for MS Marco. If you have another dataset, were there is a high chance that a random a_i and a random p_j are actually a positive pair, you have to implement a collect / sampler function that ensures that this does not happen
if i overwrite the smart_batching_collate and smart_batching_collate_text_only method in SentenceTransformer class filter out some sample with i != j (a_i, p_j) positive that make out put {'features': features, 'labels': torch.stack(labels)} size small than input parameter "batch", it will induce iterator yield different size of batches , dose this change compatible with the source code ? Or any code part i should pay attention because this change ?
and because the labels in multiple_negatives_ranking_loss of MultipleNegativesRankingLoss ,it fixed with different size of scores , i think this change may be the simple way.
You can use a batch sampler in your DataLoader to ensure that a batch doesn't have two samples with the same label
You can use a batch sampler in your DataLoader to ensure that a batch doesn't have two samples with the same label
i have try this yesterday , did you say use sampler.data_source to located the sample will yield , to perform ensurement ?
this method may be difficult when not use model.fit to debug ,because sample in batch not same length without padding in model 's smart_collect function ,
so should i debug inner model.fit or other
piece of code ?
Have a look here for an intro to batch sampler in pytorch: https://pytorch.org/docs/stable/data.html#disable-automatic-batching
A class I used looked like this:
import math
import random
import gzip
class NoSameLabelsBatchSampler:
def __init__(self, dataset, batch_size):
self.dataset = dataset
self.idx_org = list(range(len(dataset)))
random.shuffle(self.idx_org)
self.idx_copy = self.idx_org.copy()
self.batch_size = batch_size
def __iter__(self):
batch = []
labels = set()
num_miss = 0
num_batches_returned = 0
while num_batches_returned < self.__len__():
if len(self.idx_copy) == 0:
random.shuffle(self.idx_org)
self.idx_copy = self.idx_org.copy()
idx = self.idx_copy.pop(0)
label = self.dataset[idx][1].cpu().tolist()
if label not in labels:
num_miss = 0
batch.append(idx)
labels.add(label)
if len(batch) == self.batch_size:
yield batch
batch = []
labels = set()
num_batches_returned += 1
else:
num_miss += 1
self.idx_copy.append(idx) #Add item again to the end
if num_miss >= len(self.idx_copy): #To many failures, flush idx_copy and start with clean
self.idx_copy = []
def __len__(self):
return math.ceil(len(self.dataset) / self.batch_size)
There is no need to change anything internally of SentenceTransformer. Just adding a batch sampler to your data loader is sufficient.
as the note of MultipleNegativeRankingLoss.py say "You can also provide one or multiple hard negatives per anchor-positive pair by structering the data like this: (a_1, p_1, n_1), (a_2, p_2, n_2)
" it seems like (a_i, p_j) should be negative pair in a batch when i != j i review the definition of TripletsDataset ,iter method only define single sample, no collect function to confirm the negative relation between query_text from sample i and pos_text from sample j for i != j in one batch, can you explain it for me?