UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.31k stars 2.48k forks source link

A problem about data prepare in ms_marco train_bi-encoder.py #649

Open svjack opened 3 years ago

svjack commented 3 years ago

as the note of MultipleNegativeRankingLoss.py say "You can also provide one or multiple hard negatives per anchor-positive pair by structering the data like this: (a_1, p_1, n_1), (a_2, p_2, n_2)

    Here, n_1 is a hard negative for (a_1, p_1). The loss will use for the pair (a_i, p_i) all p_j (j!=i) and all n_j as negatives.

" it seems like (a_i, p_j) should be negative pair in a batch when i != j i review the definition of TripletsDataset ,iter method only define single sample, no collect function to confirm the negative relation between query_text from sample i and pos_text from sample j for i != j in one batch, can you explain it for me?

nreimers commented 3 years ago

Hi @svjack For (a_i, p_j) for i != j and i,j random: The probability that this is a positive pair is rather small. There are 500k queries, so selecting another one that is actually a positive one is greater 0, but it is so small that it does not affect the training.

So yes, some batches will violate this constrain, but the amount it too small to have any impact.

This is specific for MS Marco. If you have another dataset, were there is a high chance that a random a_i and a random p_j are actually a positive pair, you have to implement a collect / sampler function that ensures that this does not happen

svjack commented 3 years ago

Hi @svjack

For (a_i, p_j) for i != j and i,j random: The probability that this is a positive pair is rather small. There are 500k queries, so selecting another one that is actually a positive one is greater 0, but it is so small that it does not affect the training.

So yes, some batches will violate this constrain, but the amount it too small to have any impact.

This is specific for MS Marco. If you have another dataset, were there is a high chance that a random a_i and a random p_j are actually a positive pair, you have to implement a collect / sampler function that ensures that this does not happen

thanks for your reply.

svjack commented 3 years ago

Hi @svjack

For (a_i, p_j) for i != j and i,j random: The probability that this is a positive pair is rather small. There are 500k queries, so selecting another one that is actually a positive one is greater 0, but it is so small that it does not affect the training.

So yes, some batches will violate this constrain, but the amount it too small to have any impact.

This is specific for MS Marco. If you have another dataset, were there is a high chance that a random a_i and a random p_j are actually a positive pair, you have to implement a collect / sampler function that ensures that this does not happen

if i overwrite the smart_batching_collate and smart_batching_collate_text_only method in SentenceTransformer class filter out some sample with i != j (a_i, p_j) positive that make out put {'features': features, 'labels': torch.stack(labels)} size small than input parameter "batch", it will induce iterator yield different size of batches , dose this change compatible with the source code ? Or any code part i should pay attention because this change ?

svjack commented 3 years ago

and because the labels in multiple_negatives_ranking_loss of MultipleNegativesRankingLoss ,it fixed with different size of scores , i think this change may be the simple way.

nreimers commented 3 years ago

You can use a batch sampler in your DataLoader to ensure that a batch doesn't have two samples with the same label

svjack commented 3 years ago

You can use a batch sampler in your DataLoader to ensure that a batch doesn't have two samples with the same label

i have try this yesterday , did you say use sampler.data_source to located the sample will yield , to perform ensurement ? this method may be difficult when not use model.fit to debug ,because sample in batch not same length without padding in model 's smart_collect function , so should i debug inner model.fit or other
piece of code ?

nreimers commented 3 years ago

Have a look here for an intro to batch sampler in pytorch: https://pytorch.org/docs/stable/data.html#disable-automatic-batching

A class I used looked like this:

import math
import random
import gzip

class NoSameLabelsBatchSampler:
    def __init__(self, dataset, batch_size):
        self.dataset = dataset
        self.idx_org = list(range(len(dataset)))
        random.shuffle(self.idx_org)
        self.idx_copy = self.idx_org.copy()
        self.batch_size = batch_size

    def __iter__(self):
        batch = []
        labels = set()
        num_miss = 0

        num_batches_returned = 0
        while num_batches_returned < self.__len__():
            if len(self.idx_copy) == 0:
                random.shuffle(self.idx_org)
                self.idx_copy = self.idx_org.copy()

            idx = self.idx_copy.pop(0)
            label = self.dataset[idx][1].cpu().tolist()
            if label not in labels:
                num_miss = 0
                batch.append(idx)
                labels.add(label)
                if len(batch) == self.batch_size:
                    yield batch
                    batch = []
                    labels = set()
                    num_batches_returned += 1
            else:
                num_miss += 1
                self.idx_copy.append(idx) #Add item again to the end

                if num_miss >= len(self.idx_copy): #To many failures, flush idx_copy and start with clean
                    self.idx_copy = []

    def __len__(self):
        return math.ceil(len(self.dataset) / self.batch_size)

There is no need to change anything internally of SentenceTransformer. Just adding a batch sampler to your data loader is sufficient.