Get sentence embedding of other langaunge

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

https://www.sbert.net

Apache License 2.0

15.43k stars 2.5k forks source link

Get sentence embedding of other langaunge #369

Open orenpapers opened 4 years ago

orenpapers commented 4 years ago

Hello, How can I use the package to get embedding of language other than English? (e.g Hebrew) I couldn't find any code example as to how to load the model Thanks

nreimers commented 4 years ago

You just need to load one of the multi-lingual models: https://github.com/UKPLab/sentence-transformers#multilingual-models

Then you can use it as shown in the examples. Just input Hebrew instead of English.

orenpapers commented 4 years ago

@nreimers I understand but can you please give example of the proper syntax? Tried various variations but didn't work, for example:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('xlm-r-he-bert-base-nli-stsb-mean-tokens')

Got: HTTPError: 404 Client Error: Not Found for url: https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/xlm-r-he-bert-base-nli-stsb-mean-tokens.zip

nreimers commented 4 years ago

There is only one model, called xlm-r-100langs-bert-base-nli-mean-tokens, that is able to process text from 100 languages. You don't need to specify your input language. Just input your text and you get an embedding, independent of your language. You can use the same model for all the listed languages and sentences with similar meaning will be close.

orenpapers commented 4 years ago

@nreimers How does it identify the language? please notice Hebrew is RTL so should I insert the input differentely?

nreimers commented 4 years ago

It does not need to know the language. No changes needed for RTL languages