sentence-transformers for large-scale search

MichalPitr commented 4 years ago

Hi, thanks fo this fantastic repo and its documentation!

I have a question: I am working on a research project on fact-verification in Czech and as the first step we are trying various approaches to document retrieval. Our corpus is Czech Wikipedia abstracts and we have a dataset of claim-Wikipedia ID pairs.

I've split my Wikipedia abstracts into sentences and have been trying to use sentence-transformers to get meaningful embeddings and do top-k search in the embedding space. I've experimented with mBERT embeddings, which gave me pretty underwhelming results (around 3 times worse than bm25). I tried teaching an xlm-roberta using make_multilingual.py script on Czech TED parallel data (also tried OPUS with no real gains), but this performed worse than base mBERT with a mean pooling layer.

The metric I use is a modified precision@k, such that each claim has 1 wiki_id and k = 10. (I've tried running mBERT for k up to 50, which increased precision from 0.09@k=10 to 0.16@k=50, BM25 without much pre-processing nets 0.3@k=10)

1) Does it make sense for xlm-roberta after teacher-student training to perform worse than mBERT without any? 2) Do you suppose extracting embeddings from transformers can work for large-scale IR, or would I need to get more creative with the pretraining tasks, e.q. https://arxiv.org/pdf/2002.03932.pdf ?

Appreciate any response!

nreimers commented 4 years ago

Hi @MichalPitr I think the current models are not the best / not suited for this task. They were trained on sentence level, however, for retrieval, you usually want to index paragraphs.

Further, the current models are rather "recall" oriented, i.e., they have a low chance to miss something. For IR, you usually want precision oriented models, like BM25.

Currently we plan to release soon several examples (+pre-trained models) for information retrieval. The models we have so far already beat BM25 by quite a margin (on English on data sets like MS Marco). However, they can still be made better 👍

If you have suitable training data in the format (query, relevant_passage), I can recommend to have a look at this: https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py

This gave for us so far the best performance when we trained models for IR.

Current timeline:

I will release soon code examples for Information Retrieval on sentence level using the Quora duplicate questions dataset (given sentence, find duplicate sentences in a collection of 400k sentences)
More complex training procedures for passage retrieval, trained on MS Marco and Natural Questions dataset, will take more time until we have everything setup and can release code + models.

Best Nils Reimers

MichalPitr commented 4 years ago

@nreimers Many thanks for the reply.

I'll be looking forward to the examples and try to have a go at the suggested loss function with our data.

If I may ask, have you tried playing around with pre-training tasks other than MLM and NSP before doing domain-specific training, such as Inverse Cloze Task?

Regards Michal

nreimers commented 4 years ago

Hi @MichalPitr Not yet.

In this paper: https://arxiv.org/abs/2002.03932

They made quite good experiences with Inverse Cloze Task, much better than with masked language model.

So I think it will be worth wile to test this, especially as the implementation is straight forward.

In September, a new Ph.D. will join my team who will be working on this (and related topics). We hope that we can then publish better pre-training strategies.

Best Nils Reimers

pommedeterresautee commented 4 years ago

Hi @nreimers , in dense passage retrieval paper, they show that adding a hard negative per batch plus random negatives (MultipleNegativesRankingLoss) is the best approach to get high recall. Is there a way in the current library to do the same thing?

Second point is that they use a double tower architecture, is it possible to reproduce such architecture on sentence-transforemers?

thakur-nandan commented 4 years ago

Hi @pommedeterresautee , It is not possible currently at the moment with the library. But with small changes to the existing code shown below, this is possible, where embeddings_c will contain hard negatives corresponding to the given question i.e. a[i] should match with c[i].

You will need to create training examples with triplets i.e. providing anchor, positive and negative texts -

from sentence_transformers.readers import InputExample

examples = []
examples.append(InputExample(texts=[anchor, positive, hard-negative], label=1)

Updated (MultipleNegativesRankingLoss) function -

import torch
from torch import nn, Tensor
from typing import Iterable, Dict
from sentence_transformers import SentenceTransformer

class MultipleNegativesRankingLoss(nn.Module):
    """
        This loss expects as input a batch consisting of sentence pairs (a_1, b_1), (a_2, b_2)..., (a_n, b_n)
        where we assume that (a_i, b_i) are a positive pair and (a_i, b_j) for i!=j a negative pair.
        For each a_i, it uses all other b_j as negative samples, i.e., for a_i, we have 1 positive example (b_i) and
        n-1 negative examples (b_j). It then minimizes the negative log-likehood for softmax normalized scores.
        This loss function works great to train embeddings for retrieval setups where you have positive pairs (e.g. (query, relevant_doc))
        as it will sample in each batch n-1 negative docs randomly.
        The performance usually increases with increasing batch sizes.
        For more information, see: https://arxiv.org/pdf/1705.00652.pdf
        (Efficient Natural Language Response Suggestion for Smart Reply, Section 4.4)
        The error function is equivalent to::
            scores = torch.matmul(embeddings_a, embeddings_b.t())
            labels = torch.tensor(range(len(scores)), dtype=torch.long).to(self.model.device) #Example a[i] should match with b[i]
            cross_entropy_loss = nn.CrossEntropyLoss()
            return cross_entropy_loss(scores, labels)
        Example::
            from sentence_transformers import SentenceTransformer,  SentencesDataset, LoggingHandler, losses
            from sentence_transformers.readers import InputExample
            model = SentenceTransformer('distilbert-base-nli-mean-tokens')
            train_examples = [InputExample(texts=['Anchor 1', 'Positive 1']),
                InputExample(texts=['Anchor 2', 'Positive 2'])]
            train_dataset = SentencesDataset(train_examples, model)
            train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=train_batch_size)
            train_loss = losses.MultipleNegativesRankingLoss(model=model)
    """
    def __init__(self, model: SentenceTransformer):
        super(MultipleNegativesRankingLoss, self).__init__()
        self.model = model

    def forward(self, sentence_features: Iterable[Dict[str, Tensor]], labels: Tensor):
        reps = [self.model(sentence_feature)['sentence_embedding'] for sentence_feature in sentence_features]
        reps_a, reps_b, reps_c = reps
        return self.multiple_negatives_ranking_loss(reps_a, reps_b, reps_c)

    def multiple_negatives_ranking_loss(self, embeddings_a: Tensor, embeddings_b: Tensor, embeddings_c: Tensor):
        """
        :param embeddings_a:
            Tensor of shape (batch_size, embedding_dim)
        :param embeddings_b:
            Tensor of shape (batch_size, embedding_dim)
        :param embeddings_c:
            Tensor of shape (batch_size, embedding_dim)
        :return:
            The scalar loss
        """
        random_neg_scores = torch.matmul(embeddings_a, embeddings_b.t())
        hard_neg_scores = torch.unsqueeze(torch.diag(torch.matmul(embeddings_a, embeddings_c.t())), 1)
        scores = torch.cat((random_neg_scores, hard_neg_scores), 1)
        labels = torch.tensor(range(len(scores)), dtype=torch.long).to(self.model.device) #Example a[i] should match with b[i]
        cross_entropy_loss = nn.CrossEntropyLoss()
        return cross_entropy_loss(scores, labels)

Kind Regards, Nandan Thakur

pommedeterresautee commented 4 years ago

Thank you for the super rapid answer. Obviously you are working on it. Did you see some significant improvement in your tests, if any? Plus is there other strategies you are working on which significantly improve embeddings quality in the context of question answering task? (like in Ms Marco, etc.)

nreimers commented 4 years ago

Second point is that they use a double tower architecture, is it possible to reproduce such architecture on sentence-transforemers?

Sadly not yet, so far the same architecture is used for both types of input. But I am working on it to enable it.

pommedeterresautee commented 4 years ago

In the paper they don't compare single and double tower architectures, I am wondering if double towers are that useful as there are usually 16 heads per transformer model, we can imagine that during fine tuning on a QA task, some specializes on questions and others on answers... (plus on a pragmatic point of view, 2 models takes more space in memory, implying smaller batches... meaning less random negative per batch). anyway can't wait your conclusions! I will test hard negative learning tomorrow and will report here.

thakur-nandan commented 4 years ago

Thank you for the super rapid answer. Obviously you are working on it. Did you see some significant improvement in your tests, if any? Plus is there other strategies you are working on which significantly improve embeddings quality in the context of question answering task? (like in Ms Marco, etc.)

I am currently conducting experiments with MSMARCO passage retrieval datasets where I find random negatives with MultipleNegativesRankingLoss with a big batch size (>28) produces good scores.

Adding hard negatives is leading to minimal improvement in performance but the downside being requiring additional GPU memory (you'll have extra encodings per batch (hard negatives)). If memory is not an issue then I would advise you can add one hard negative per batch, with the technique suggested above.

Regarding other strategies, I've only tried TripletLoss which often leads to worse performances, I would suggest using random-negatives with MultipleNegativesRankingLoss works quite well by default. Even for NQ and SQuAD v1.1 we have seen good results with the same.

pommedeterresautee commented 4 years ago

I ve trained 2 models this night, one with the 1 hard negative and one without... I got a R@1 difference of 1.3 absolute point, not nothing but clearly a lot less that what have been reported in DPR paper. Something I learned with doing transformers based reranking is that it's quite hard to find the right hard negative. In my case, they are mined using TF-IDF, however I have filtered too closed examples (too high scores), and applied other rules specific to my dataset to avoid too hard negatives. Regarding number of examples... on ms marco it s usually said that you should keep 512 tokens to not truncate any example. In case of 12Gb memory, would you recommend to limit number of tokens during the training to get largest batch and then infer with 512 tokens?

thakur-nandan commented 4 years ago

@pommedeterresautee , Yes even I have found sentence-transformers have difficulties learning hard negatives and often perform worse if you give too many hard negatives in the same batch.

I'm not sure what would work best, you could try with 512 tokens and smaller batch sizes. I am experimenting with a sequence length of 256 tokens, batch-size of 28, MultipleNegativesRanking Loss (random negatives) which should take around 11-12GB, and it gives optimal performance for the MSMARCO passage dev dataset.

Kind Regards, Nandan

UKPLab / sentence-transformers

sentence-transformers for large-scale search #332