Finetuning cross-encoder/ms-marco-MiniLM-L-6-v2 on my dataset

thejaswiraya commented 2 years ago

I have been using pre-trained cross-encoder/ms-marco-MiniLM-L-6-v2 on a dataset similar to MS-MARCO for re-ranking paragraphs based on a query/question. The top-3 accuracy results have been pretty good. I would like to improve on top-1 accuracy.

I do have a decent sized dataset comprising of question, and positive paragraphs which I have collected from my application. For these questions I also have several hard negative paragraphs. I intend to fine-tune cross-encoder/ms-marco-MiniLM-L-6-v2 on this dataset to improve top-1 accuracy. The dataset has several thousand question and positive paragraph pairs.

For fine-tuning, my current thought-process is to use knowledge distillation identical on my dataset identical to the one provided here in this repo based on Sebastian Hofstätter et al. paper - https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_cross-encoder_kd.py. However, doing so would require me to train 3 large models (BERT-base, BERT-large, ALBERT-large) on my dataset, collect the logits, and then use the logits to finetune minilm (identical to what's being done in train_cross-encoder_kd.py). This would be extremely time-consuming and compute intensive.

Can I get the logits from running inference on my dataset for positive and hard negative pairs on a different model like https://huggingface.co/sebastian-hofstaetter/distilbert-cat-margin_mse-T2-msmarco?
Should I just finetune cross-encoder/ms-marco-MiniLM-L-6-v2 on my dataset with a binary cross-entropy loss?
Is there a better alternative to finetune cross-encoder/ms-marco-MiniLM-L-6-v2 on my dataset?

nreimers commented 2 years ago

1) Sadly the distilbert model is also just a distilled version of the three large models. Also DistilBERT is weaker than MiniLM. So this option would not really make sense 2) Yes, that is the way I would opt for. You can use Binary CE or implement a ListRankLoss to train your model

oekosheri commented 10 months ago

@nreimers Hi, does it make sense to fine tune a bi encoder model on a customized dataset, such that we won't need to run cross encoder to refine the ranking?

amitguptadumka commented 9 months ago

@thejaswiraya were you able to finetune finally? If yes, can you share the observability?

tomcotter7 commented 3 months ago

Sadly the distilbert model is also just a distilled version of the three large models. Also DistilBERT is weaker than MiniLM. So this option would not really make sense

Yes, that is the way I would opt for. You can use Binary CE or implement a ListRankLoss to train your model

Does anyone have the paper which describes ListRankLoss? Is it this?.

UKPLab / sentence-transformers

Finetuning cross-encoder/ms-marco-MiniLM-L-6-v2 on my dataset #1353