msmarco-MiniLM-L6-cos-v5 hyperparameters

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

https://www.sbert.net

Apache License 2.0

15.1k stars 2.46k forks source link

msmarco-MiniLM-L6-cos-v5 hyperparameters #1220

Open vjeronymo2 opened 3 years ago

vjeronymo2 commented 3 years ago

Hey Nils Can you share the hyperparameters for the msmarco-MiniLM-L6-cos-v5 model and script call as well? I assume you used the train_bi-encoder_mnrl.py. I'm trying to reproduce it but I'm not even close to the MRR@10 you got on MSMARCO of 32.27.

Thanks in advance!

nreimers commented 3 years ago

It was a code similar to train_bi-encoder_mnrl.py, but a lot uglier as it was research code and contains paths specific to my setup.

Will test the script, but a performance similar to 32.27 should be achievable.

What performance did you get?

nreimers commented 3 years ago

I don't have all the original training parameters or scripts. But here is what I got from my excel table with the results.

I think I first trained the MiniLM L12 model with a batch size of 100. Then I extracted 6 layers and trained it with batch size 140. I then did hard negative mining with the cross-decoder denoising I continue training with the specific hard negatives for that model.

I have currently the published script running for the MiniLM-L6 model. Will see what the performance will be.

vjeronymo2 commented 3 years ago

I already ran your fine tuned msmarco-MiniLM-L6-cos-v5 and got the same MRR@10 of 32.27 =) Training train_bi-encoder_mnrl.py with the following parameters yielded MRR@10 of 0.3040: --lr=1e-5 --warmup_steps=10000 --num_negs_per_system=10 --epochs=20 --train_batch_size 32 --accumulation_steps 4 # gradient accumulation --negs_to_use=distilbert-margin_mse-sym_mnrl-mean-v1

I'll try using the default parameters and this batch of 140 you specified

vjeronymo2 commented 3 years ago

I wanna get the parameters right before I train a bi-encoder multilingual miniLM (I'm working with Rodrigo)

nreimers commented 3 years ago

@vjeronymo2 I just let it run with all the parameters in the file and I get a performance of 32.87 after 4 epochs. Performance after 10 epochs is currently evaluated.

Some notes: accumulation_steps does not really help. Important is a large batch size, the larger, the better. Have a look at the picture here (and the RocketQA paper): https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354

accumulation_steps cannot simulate larger batch sizes: With larger batch sizes the task difficulty increases, leading to better models. 32 is an extremely small batch size. Try to get it as large as possible.

Also regarding negs_to_use: For the first run, it is recommended to use all negatives from all systems. A more diverse set of negatives makes the model more robust.

Edit: After 10 epochs I get a performance of: 33.35 MRR@10

vjeronymo2 commented 2 years ago

Hi @nreimers, I read the paper and it's clear to me that increasing the batch size should lead to better models (to a certain extent). But, to my still in development knowledge, gradient accumulation should simulate larger batches. I tested this hypothesis by emulating a batch of 8192 on training and got much worse results than a single batch of 32. Why can't we use acc_steps? Is it something specific to MultipleNegativesRankingLoss?

nreimers commented 2 years ago

Hi @vjeronymo2 You use the in-batch examples as negatives.

It is like a multiple-choice test. With a batch-size of 32, you have a query given and you have to find the correct passage out of 32 candidates.

With a batch size of 256, you have a query given, but now you have to find the correct passage out of 256 candidates => a lot more difficult => stronger training signal.

Having a batch size of 32 and doing grad_acc of 8 is like taking sequentially 8 multiple-choice tests where each has 32 candidate answers. That's not harder than taking the single multiple choice with 32 answer candidates.

vjeronymo2 commented 2 years ago

Ohh, I think I get it now. Thanks a lot for the explanation, Nils. I'll try with a larger batch size with an A100 and post results here.