UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.85k stars 2.44k forks source link

Japanese MS-MARCO model for large scale Asymmetric Search #1785

Open Aniketto16 opened 1 year ago

Aniketto16 commented 1 year ago

Hii everyone!

I wanted to know the exact training procedure/script for training a Japanese bi-encoder for asymmetric search. I am planning to use the translated version of ms-marco : https://github.com/unicamp-dl/mMARCO

I am farily new to sentence transformers I don't know much about training my own model, from what I know :

  1. I can train the bi-encoder from scratch using : https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_mnrl.py, I am trying to use cl-tohoku-bert as the init model but I am not sure of the exact procedure.
  2. I can distill the model using knowledge distillation but I cannot find any script for extracting multilingual model for MS-MARCO.

I want to know the exact procedure and a sample training script would be really helpful too, but please guide me what should be my approach and thank you so much for your awesome work out here!

buoi commented 1 year ago

Hi, I'm doing the same, I'm trying to use the https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/train_bi-encoder_margin-mse.py script by changing queries and corpus to the target mmarco starting from a multilingual pretraiend model.
I think MarginMSE Loss should be better than MultipleNegativeRankingLoss for a single small GPU, as stated by @nreimers. Do you think this could work by using the same hard negatives and cross-encoder scores from the original MSMARCO script?

Aniketto16 commented 1 year ago

Hello @buoi, I completely agree that biencoder with MarginMSE will work the best for the use case.

But I wanted to clarify few things :

  1. Which model did you use to initialize the base sentence transformer exactly ?
  2. I understand that you are planing to use the same cross encoders scores as generated for English MARCO, can you explain how would you expand this to Multilingual dataset?

If you are not busy with some work, can you help me ready the training script! Thank you so much for the reply, looking forward to work with you.