How to reproduce ms marco dense embeddings models?

pommedeterresautee commented 3 years ago

Hi,

Recently, you have pushed by a large margin the MRR score of Ms Marco dense embeddings scores. There is a script to train the model in this repo (https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder.py) and I am wondering if there are other things you use to push results?

I am asking as I work on a large private dataset (700K q/a), I am using the scripts here (bi and cross), cross encoder provides an accuracy score 12 points better than bi-encoder. In end to end setup, Elasticsearch (no boost, no trick, just stemming) + cross encoding provides much better results than dense embedding + cross encoding (with brute force search, so no issue with approx search).

I have tried tricks to increase the batch size of the bi-encoder (24Gb GPU RAM), for instance with "gradient checkpointing", I can go from 48 to 256 examples per batch (btw it would be a nice add it to the bi encoder script, training will be around 20-30% slower but who cares :-) ), it provides some improvement (+3 points of classification accuracy), but it does not seem enough to come close to cross encoding results. I am quite sure that the new release of deepspeed can help to increase batch size, probably around 400 something examples by off loading stuff to the host memory, but I am also quite sure that it won't be enough to provide what is required to be useful in end to end setup. Same with playing with the learning rate, adam hyper params, etc.

XP of the last 10 days (with different ratio pos / neg and different batch size, LR, etc.), group on the top are cross encoders, group on the bottom are bi encoders. Best bi encoder (pink) is the one with gradient checkpointing and very large batch size.

In rocketqa paper, they list some tricks, regarding negative examples generation: generate negative with dense embedding instead of BM25 (they say the same in ANCE paper) and filtering too good negative with cross encoding, are you using some of those?

Do you apply some tricks on the dimension reduction like applying layer norm at the end like in https://arxiv.org/pdf/2012.15156.pdf / https://arxiv.org/pdf/2005.00181.pdf?

In my case, too good negative are filtered by rules (does the question appear in the title / text, etc.), filtering with cross encoder changes almost nothing to both cross encoder and bi encoder models (before we used those strict rules, filtering with a cross encoder model had a large impact on scores, and now it has almost no effect on accuracy score, so the rules are probably good enough).

Also I noticed that you are using a 1 positive for 4 negative examples on cross encoder but not for bi-encoder, have you noticed a difference in behavior? (on rocket qa they are using 1-4 ratio for dense embeddings).

So you got the idea, is there some secret sauce?

FWIW

...
    if model_args.bi_encoder:
        word_embedding_model = models.Transformer(
            model_args.model_name_or_path,
            max_seq_length=model_args.seq_len,
            model_args={"gradient_checkpointing": model_args.gradient_checkpointing},
        )
        pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
        dense_model = models.Dense(
            in_features=pooling_model.get_sentence_embedding_dimension(),
            out_features=256,
            activation_function=nn.Tanh(),
        )
        model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])
        train_loss = losses.MultipleNegativesRankingLoss(model=model)
        wandb.watch(model)

        model.fit(
            train_objectives=[(train_dataloader, train_loss)],
            evaluator=evaluator,
            epochs=model_args.num_train_epochs,
            warmup_steps=model_args.warmup_steps,
            evaluation_steps=model_args.eval_steps,
            optimizer_params={"lr": model_args.learning_rate, "eps": 1e-6, "correct_bias": False},
            output_path=model_args.output_dir,
            use_amp=model_args.fp16,
        )
    else:
        model = CrossEncoder(model_args.model_name_or_path, num_labels=1, max_length=model_args.seq_len)

        wandb.watch(model.model)

        model.fit(
            train_dataloader=train_dataloader,
            evaluator=evaluator,
            epochs=model_args.num_train_epochs,
            evaluation_steps=model_args.eval_steps,
            optimizer_params={"lr": model_args.learning_rate, "eps": 1e-6, "correct_bias": False},
            warmup_steps=model_args.warmup_steps,
            output_path=model_args.output_dir,
            use_amp=model_args.fp16,
        )

nreimers commented 3 years ago

@pommedeterresautee The v2 models have all been trained with the same script. In my experiments I noticed that I get with BERT / DistilBERT better (sometimes far better) results than with RoBERTa / DistilRoBERTa.

I use the following way for training (hope I can update it soon): 1) MS MARCO provides triplets with hard negatives, which were mined using BM25. Similar to the RocketQA paper, I classify them using a cross-encoder. For each query, I use 20 samples that had a cross-encoder score below 0.1 2) I encoded 1 Million passages with a preliminary BiEncoder. For each train query, I retrieve the top 100 documents and classify them with the Bi-Encoder. I use the top 20 samples that have a cross-encoder score below 0.1. 3) I repeat step 2 with a different BiEncoder I had from a previous setup.

So for every query I have the positive passages and 60 hard negatives (20 from BM25, 40 from semantic search). All hard negatives have a cross-encoder score below 0.1

I then train with batch-size 75 and max_seq_len=350

Also I noticed that you are using a 1 positive for 4 negative examples on cross encoder but not for bi-encoder, have you noticed a difference in behavior? (on rocket qa they are using 1-4 ratio for dense embeddings). The bi-encoder has 1 hard negative and batch_size-1 + batch_size random negatives (the positives and hard negatives from all other triplets in a batch).

For NQ I tested to include 2 and 5 hard negatives for every query: For 2 hard negatives, I saw a slight improvement. For 5 hard negatives, I saw a slight drop. When adding more hard negatives in each batch, you have to update your batch size.

I am also looking forward to try DeepSpeed to be able to run it with larger batch sizes.

pommedeterresautee commented 3 years ago

Thank you @nreimers for your very complete answer. I will try with much more negative examples and mining semantically similar data points.

If I understand correctly, you have implemented most of the rocketQA stuff, do you think that the difference with the score from the paper is related to the batch size? the siamese architecture using 2 independent transformers at the same time? ERNIE model? Something else?

Regarding negative examples, do you put them in the same InputExample([anchor, pos, neg1, neg2, ...]) or multiple InputExample([anchor, pos, neg1]) ?

nreimers commented 3 years ago

Hi @pommedeterresautee Yes, I think from the paper the main difference is the batch size. As you see in figure 4, with a batch size of 128 you get a score of about 31, while with a batch size of 4096 you get a score of 36.

So having the model trained on multiple large GPUs appears quite beneficial.

I use this format to pass multiple hard negatives per query: InputExample([anchor, pos, neg1, neg2, ...])

pommedeterresautee commented 3 years ago

Thank you I am rewriting my pipelines to follow the format. You have probably seen it yesterday, but in case of : https://arxiv.org/pdf/2101.06983.pdf + https://github.com/luyug/GC-DPR + https://github.com/luyug/Reranker

The author claims to have trained DPR on a single 2080 GPU in 2 days (instead of 8 V100 in 1 day) with a very interesting gradient cache mechanism (a 2 pass mechanism, one pass without the back propagation to compute lots of representation and a second one where the same large batch is divided in smaller ones and the representations computed in the first pass are used as neg examples).

It may match quite well with sentence-transformers :-) (much better than deepspeed and gradient checkpointing)

Bonus, appendix A of the paper gives a nice description of the code!

nreimers commented 3 years ago

@pommedeterresautee Thanks for the nice reference. Hope I can implement the tricks they used.

pommedeterresautee commented 3 years ago

I have implemented the examples with multiple neg per positive and now understand your comment about updating the batch size.

Can you help me to understand those points:

you build batch of size 75, and each example has 60 negative examples, how does it match your GPU Ram? (are you doing some tricks like splitting the neg examples in several lists or using custom hardware with lots of RAM or ?)
in RocketQA paper but also in Fb one (https://arxiv.org/pdf/2006.11632.pdf) they say they use up to 2 (Fb) or 4 (Baidu) neg examples per positive examples, more decreases recall (according to Fb which works on its own data), have you noticed large improvement in putting 60 neg examples? In my XP, I have noticed that 20 neg examples vs 5 increases slightly scores, as much as training during 2 epochs with 5 examples, and at the end it's not that effective for the time spent to train
in Fb they insist on the importance to not take the best matching results as hard negative examples but to target around the 100s-150s position in the retrieved list, did you test that strategy?

For what it worth, training with multiple hard negative in a single Example (instead of repeating the positive example for each negative sample) had a nice effect (blue line, pink was the best previous bi encoder, and higher scores are related to cross encoders):

nreimers commented 3 years ago

you build batch of size 75, and each example has 60 negative examples, how does it match your GPU Ram? (are you doing some tricks like splitting the neg examples in several lists or using custom hardware with lots of RAM or ?) The batch size is 75 and for every (query, positive_passage) I use one out of 60 hard negatives. The 60 is just for the total number of available hard negatives I choose from.

in Fb they insist on the importance to not take the best matching results as hard negative examples but to target around the 100s-150s position in the retrieved list, did you test that strategy? I made similar experience, that hard negatives can sometime be positives. By using the the classification with the cross-encoder first, I remove the majority of relevant / matching passages. So In that case, I don't need to skip the top most similar items.

pommedeterresautee commented 3 years ago

Hi @nreimers,

You have recently updated the documentation regarding dense embedding models. It highlights the difference between symmetric and asymmetric models.

I wonder if by that you mean that to train QA models, you advise to use Asym class from https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/models/Asym.py ?

Is the difference with "classic" approach (same last layer for both query/paragraph) like in https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder.py is significant in score/quality?

nreimers commented 3 years ago

@pommedeterresautee No, I did not imply to train with Asym class. I added this to the docs to make it more clear if you have a symmetric (query & doc have the same amount of content) or if you have an asymmetric case (query is short, doc is longer).

Currently we evaluate what the best method is for the asymmetric case. But so far we don't have a conclusion. Once we have a conclusion, and know hot to train for asymmetric cases (short query, long doc), the docs will be updated with the respective recommendations.

djstrong commented 3 years ago

@pommedeterresautee Do you have sentence-transformers integrated with wandb?

pommedeterresautee commented 3 years ago

Yes, I think it works kind of out of the box

djstrong commented 3 years ago

Hmm, wandb.init and wandb.watch(model) haven't logged anything.

pommedeterresautee commented 3 years ago

You can log directly in wandb, it will require a custom measure on dev test

djstrong commented 3 years ago

Thank you! I have added wandb logging to EmbeddingSimilarityEvaluator: https://github.com/djstrong/sentence-transformers/tree/wandb if someone interested.

RobertHua96 commented 1 year ago

Hi @pommedeterresautee could you share how you implemented gradient checkpointing for ST?

pommedeterresautee commented 1 year ago

Hi, TBH we are not using anymore sentence embeddings, at the end we found out that - for our needs - ST adds little added value and makes everything (including gradient checkpointing) a bit more annoying. I have not anymore the code to do it. I would advise you to consider carefully if you need it.

pommedeterresautee commented 1 year ago

@RobertHua96 lucky you, by chance I found an old version of the project... and how we did for the gradient checkpointing!

    word_embedding_model = models.Transformer(
        args.input,
        max_seq_length=512,
    )
    word_embedding_model.auto_model.gradient_checkpointing_enable()
    assert word_embedding_model.auto_model.is_gradient_checkpointing
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
    dense_model = models.Dense(
        in_features=pooling_model.get_sentence_embedding_dimension(),
        out_features=256,
    )

RobertHua96 commented 1 year ago

Thank you so much @pommedeterresautee!

UKPLab / sentence-transformers

How to reproduce ms marco dense embeddings models? #705