Model for working with multi-languages

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

https://www.sbert.net

Apache License 2.0

15.19k stars 2.47k forks source link

Model for working with multi-languages #1232

Open danabez1 opened 3 years ago

danabez1 commented 3 years ago

Hello, I am working with data that contains a mix of languages (e.g. English, France, Spanish) I would like to use pre-trained sentence-bert model to find similarities between pairs of text. In some cases, text1-text2 are from the same language, and in others a mix of 2 languages.

Which model do you recommend me to the work with?
Is 'all-mpnet-base-v2' suitable for my case? Is the prefix 'all' implies it is suitable for multi-languages as well?
What is the advantage of using multilingual models, like 'paraphrase-multilingual-mpnet-base-v2'? Thank you

nreimers commented 3 years ago

The all-* models are only trained on English. The all means 'all-training-datasets'.

You must use one of the multilingual models like paraphrase-multilingual-mpnet-base-v2

danabez1 commented 3 years ago

Thank you :)

Tortoise17 commented 2 years ago

@nreimers I have one question. as example. If we have text as dataframes in English and query is in German or French (es example) , will it return the answer match in German or it is not possible with such kind of problem with multi-language model.

nreimers commented 2 years ago

Languages are aligned, so it will find the closest match irrespective of the language

Hayat21 commented 1 year ago

Hello there I hope your are doing well Can you tell me which model support arabic language, and must be fast and performant for semantic Thanks

jaideep11061982 commented 1 year ago

@nreimers @danabez1 how many languages does paraphrase_ml supports ? does it supports 100 languages or 15 languages ?