UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.23k stars 2.47k forks source link

Fine-tuning multilingual bi-encoder with GenQ #1276

Open Matthieu-Tinycoaching opened 2 years ago

Matthieu-Tinycoaching commented 2 years ago

Hi,

1/ Is it possible to fine tune a multilingual bi-encoder on specific domain data using unsupervised synthethic query generation? If yes, are performance comparable to fine-tuning with supervised method?

2/ I didn't find multilingual pre-trained T5 model, could I use translation algorithm to fine-tune on the English one, then translate back to native language?

3/ For multilingual bi-encoder does it suppose that I have to fine-tune with GenQ on all languages of interest at the same time?

Thanks!

nreimers commented 2 years ago

Hi, 1) Yes it is possible. @kwang2049 will soon release an updated version, that is a lot better. But gold data will still be the best

2) Sadly generation models for other languages are limited, as there is not so much good training data. Here you can find translated versions of MS MARCO: https://github.com/unicamp-dl/mMARCO

You could train an mT5 generation model on this. Note: The mT5 models are there not for generation, but for re-ranking

Otherwise you can also use machine translation

3) Yes, this would be the best

Matthieu-Tinycoaching commented 2 years ago

Hi @nreimers thanks for feedback!

  1. Great news! You mean that that a new multilingual model better than paraphrase-multilingual-MiniLM-L12-v2?

  2. How could I train an mT5 generation model based on https://github.com/unicamp-dl/mMARCO? If the mT5 is trained for re-ranking but not generation, what would be the interest of such multilingual model in order to generate question from passages?

  3. OK, good.

nreimers commented 2 years ago

1) Yes, hopefully :) 2) They provide a translated version of MS MARCO, which you can use to train an mT5 model

Matthieu-Tinycoaching commented 2 years ago
  1. Would this unsupervised fine-tuning improve the bi-encoder for both semantic search and retrieve/rerank?
nreimers commented 2 years ago

yes

Matthieu-Tinycoaching commented 2 years ago

@nreimers could you give me an estimate on when the new multilingual bi-encoder will come out?

nreimers commented 2 years ago

Hi @Matthieu-Tinycoaching It will be a focus for Q1 in 2022. The crawling of large multilingual datasets made a good progress and I hope it will result in good models.

Matthieu-Tinycoaching commented 2 years ago

Hi @nreimers good news!