UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

Multilingual multi-qa models #1217

Open Matthieu-Tinycoaching opened 2 years ago

Matthieu-Tinycoaching commented 2 years ago

Hi,

Would there be any multilingual versions of the multi-qa models or particularly english-french model?

Thanks!

nreimers commented 2 years ago

We are currently working on it. It is a larger project, as we need to mine large quantity of authentic multilingual training data. Further, how to train the model is unclear. Also, the evaluation is not clear, as there are not only few resources available so far. So I expect this takes a bit longer.

You can check multilingual MS MARCO: https://github.com/unicamp-dl/mMARCO

it as a manche translated version of MS MARCO which can be used for training models.

Matthieu-Tinycoaching commented 2 years ago

Hi @nreimers thanks for helping.

Would it then be possible to use the multilingual MS Marco model you proposed (https://github.com/unicamp-dl/mMARCO) for asymmetric semantic search or QA?

nreimers commented 2 years ago

Yes

Matthieu-Tinycoaching commented 2 years ago

OK.

In case of QA, is it better to apply retrieve & re-rank or just the multilingual MS Marco model? In case, retrieve & re-rank is better which multilingual cross-encoder could I then use?

Matthieu-Tinycoaching commented 2 years ago

Hi @nreimers could give me an advice to my last question?

nreimers commented 2 years ago

Retrieve and re rank is better, but slower. As of now there are not that many multilingual cross encoders. I just know these https://github.com/unicamp-dl/mMARCO

nickchomey commented 1 year ago

@Matthieu-Tinycoaching I just made this comment in another similar issue (#1011)

Has anyone here tried the newest multilingual Cross Encoder model? It uses multilingual versions of the MiniLM and MSMarco datasets. It doesn't appear to be in the SBert documentation, but I just stumbled upon it while browsing HF. https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1

There isn't any benchmark data, but this paper seems to have used a fairly similar process and shows that these multilingual datasets/models provide very competitive results when compared to monolingual datasets. https://arxiv.org/pdf/2108.13897.pdf