UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.84k stars 2.44k forks source link

Cross_encoder for Multi-Lingual Models #780

Open wenjun90 opened 3 years ago

wenjun90 commented 3 years ago

Hi Nils Reimers, Thank you for your great job, I want to ask you if Cross encore can be applied to Multi-language or not? Specifically I want to apply the simimarity text for French is it good? Thank you very much!

nreimers commented 3 years ago

So far these are only trained on English and will not achieve good results for other languages.

If you have training data, training these for other languages is easy.

Note, you can use the multilingual bi encoder for your task

wenjun90 commented 3 years ago

Hi @nreimers, Thank you for your reply. I want to calculate similarity between a short sentence and a paragraph of about 2000 word pieces, will the multi-langual model (distiluse-base-multilingual-cased) still work well for sentence embedding in this case?

Thank you very much again!

nickchomey commented 1 year ago

I just made this comment in another similar issue - it should solve this problem.

Has anyone here tried the newest multilingual Cross Encoder model? It uses multilingual versions of the MiniLM and MSMarco datasets. It doesn't appear to be in the SBert documentation, but I just stumbled upon it while browsing HF. https://huggingface.co/cross-encoder/mmarco-mMiniLMv2-L12-H384-v1

There isn't any benchmark data, but this paper seems to have used a fairly similar process and shows that these multilingual datasets/models provide very competitive results when compared to monolingual datasets. https://arxiv.org/pdf/2108.13897.pdf