UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.95k stars 2.44k forks source link

Make_multilingual msmarco #1076

Open Borg93 opened 3 years ago

Borg93 commented 3 years ago

Is it possilbe to make this model multilingual with target language as Swedish. Also, is it possible to distill/make_multilingual any model if you have parallell data for both source and target language? Does it have to be on the same data it was trained on? or is enough with the same task it was trained on?

nreimers commented 3 years ago

Hi @Borg93 The model you mention is a generation model, i.e. input is a paragraph, output is a question. Creating a multilingual model out of this is not straight forward. You could machine translate a corpus and train a suitable model on it.

Borg93 commented 3 years ago

Okey, Msmarco in general than? Lets say I would like to distill a Bi-encoder and a cross-encoder to better understand Swedish. Does it have to be trained on the same data (msmarco-dataset) to be "convreted" into a multilingual.

So this is my overall strategy I was thinking on doing:

Screenshot 2021-07-21 at 09 30 38
  1. I just trained 1 model, (teacher-model) paraphrase-mpned-base-v2 + xlm-r-roberta (student-model), into a swedish with ted-corpus (like you did on your script) to afterwards train on a gold-standard set of 100k on swedish-indomain duplicate questions, to improve clustering for my domain specific data.

  2. I also distilled (teacher) msmarco-distilbert-base-v4 with xlm-r-roberta (student-model) into a swedish model to use as a bi-encoder and will to the same for the cross-encoder/ms-marco-MiniLM-L-12-v2 to use as IRL pipelinen for semantic search. I dont really have any training data for this part to improve the accuracy of the search for indomain. However, thats why I was thinking of the BeIR/query-gen-msmarco-t5-large-v1 to create some indomin training data.

  3. Also thinking about talking the parts from the clustering side, such as the bi-encoder and train a cross-encoder on indomain data for a symmetric semantic search.

Does 1,2,3 seem reasonable?

nreimers commented 3 years ago

1) Sounds reasonable. The issue with the TED corpus that it is just sentences. If you encode also longer paragraphs, translated paragraphs would be helpful. 2) You could machine translate the MS MARCO corpus and than train an mT5 model on it

nickchomey commented 1 year ago

I just made this comment in another similar issue - it should solve this problem.

Has anyone here tried the newest multilingual Cross Encoder model? It uses multilingual versions of the MiniLM and MSMarco datasets. It doesn't appear to be in the SBert documentation, but I just stumbled upon it while browsing HF. https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1

There isn't any benchmark data, but this paper seems to have used a fairly similar process and shows that these multilingual datasets/models provide very competitive results when compared to monolingual datasets. https://arxiv.org/pdf/2108.13897.pdf