Open thejaswiraya opened 2 years ago
1) Sadly the distilbert model is also just a distilled version of the three large models. Also DistilBERT is weaker than MiniLM. So this option would not really make sense 2) Yes, that is the way I would opt for. You can use Binary CE or implement a ListRankLoss to train your model
@nreimers Hi, does it make sense to fine tune a bi encoder model on a customized dataset, such that we won't need to run cross encoder to refine the ranking?
@thejaswiraya were you able to finetune finally? If yes, can you share the observability?
- Sadly the distilbert model is also just a distilled version of the three large models. Also DistilBERT is weaker than MiniLM. So this option would not really make sense
- Yes, that is the way I would opt for. You can use Binary CE or implement a ListRankLoss to train your model
Does anyone have the paper which describes ListRankLoss? Is it this?.
I have been using pre-trained cross-encoder/ms-marco-MiniLM-L-6-v2 on a dataset similar to MS-MARCO for re-ranking paragraphs based on a query/question. The top-3 accuracy results have been pretty good. I would like to improve on top-1 accuracy.
I do have a decent sized dataset comprising of question, and positive paragraphs which I have collected from my application. For these questions I also have several hard negative paragraphs. I intend to fine-tune cross-encoder/ms-marco-MiniLM-L-6-v2 on this dataset to improve top-1 accuracy. The dataset has several thousand question and positive paragraph pairs.
For fine-tuning, my current thought-process is to use knowledge distillation identical on my dataset identical to the one provided here in this repo based on Sebastian Hofstätter et al. paper - https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_cross-encoder_kd.py. However, doing so would require me to train 3 large models (BERT-base, BERT-large, ALBERT-large) on my dataset, collect the logits, and then use the logits to finetune minilm (identical to what's being done in train_cross-encoder_kd.py). This would be extremely time-consuming and compute intensive.