UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.44k stars 2.5k forks source link

Reproducing msmarco-distilbert-dot-v5 training #1360

Open basilevancooten opened 2 years ago

basilevancooten commented 2 years ago

Hey there,

My team and I have been really amazed at the latest results that are displayed by the msmarco-distilbert-dot-v5 (HF card available here) model on the MS MARCO passage dev set, it's quite astonishing!

I've been able to use the model for inference and obtained an MRR@10 similar to yours 😄 and the next step for me is to reproduce the training of that model in order to replicate it with a different training set.

Following the script given in the HF model car here, I've stumbled upon two issues:

  1. file msmarco-hard-negatives-v6.jsonl.gz seems to have a been a local file that I can't find exactly in the HF datasets, the closest I found was the msmarco-hard-negatives, that comprises of a score file (cross encoder of a mini lm 6 v2 based model scores for a bunch of (qid, pid) as a Dict[int, Dict[int, float]]) and a mined negatives files as a Dict[int, Dict[str, List[int]]. From what I've figured the training script takes an intermediary mined negatives file that has the cross-encoder scores in the same structure, which would be combination of both aforementioned files, so what I did is I manually combined both to be able to run the script (you can also tell me if I was misguided at this step, but seems ok to me and this point is basically resolved). => ✔️
  2. The script was apparently launched (cf last line of the script) with arguments: --model final-models/distilbert-margin_mse-sym_mnrl-mean-v1, and if I understand correctly, this means that the script uses a pretrain distilbert model and I can't seem to find it in the model Hub or anything, is there anyway for you to tell me how to get it? => ❌ 😢

Thank you in advance for you precious help,

Peace ☮️ 🤙

PS: also thank you for the amazing work you and your team have done on this library and congratulations on all the research results you've obtained so far.

nreimers commented 2 years ago

You can find a clean and nice version of the training here: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_margin-mse.py

It will produce a model with similar performance.

Otherwise for the specific model training was done in two iterations: 1) Start with the distilbert-base-uncased model and train with MarginMSE + MultipleNegativesRankingLoss 2) Use the model from 1) and mine hard negatives. Score them all using a cross-encoder 3) Continue training of that model with margin-mse loss and the specific hard negatives.

But the above linked script will produce a model that is on par

basilevancooten commented 2 years ago

thank you :) I'm going to try that out and I'll let you know asap

basilevancooten commented 2 years ago

one last question, just so I'm sure I understood clearly, launching the script with the following arguments:

--model_name distilbert-base-uncased --lr=1e-5 --warmup_steps=10000 --negs_to_use=distilbert-margin_mse-sym_mnrl-mean-v1 --num_negs_per_system=10 --epochs=30 --name=cnt_with_mined_negs_mean --use_pre_trained_model --train_batch_size 64

should amount to a model on par?

nreimers commented 2 years ago

No, you can just launch it with the default parameters and the distilbert-base-uncased model

basilevancooten commented 2 years ago

hey there, just to let you know, I relaunched the script with default parameters and distilbert-base-uncased model as an original model, I obtained a model that reached 0.356 of MRR for now, which is good enough for now - I'll let you know if I'm able to get 0.37 cheers :)

Taosheng-ty commented 1 year ago

Same here. I follow the direction and use the default parameters and the distilbert-base-uncased model. However, I can only get 0.354. Any suggestion would be helpful.