UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

Multilingual models are uncased? #729

Open tide90 opened 3 years ago

tide90 commented 3 years ago

A short question where I cannot find the answer. Are the multilingial models use uncased data? Like the distilbert-multilingual-nli-stsb-quora-ranking. I think german is also trained on (the languages used aee actaully not listed). Is then german text uncased eg?

Actually this can be meant for all multilingual models like paraphrase-xlm-r-multilingual too...

nreimers commented 3 years ago

The multilingual models are cased

tide90 commented 3 years ago

Thanks, I actually assumed it would be uncased as I thought english is mostly an uncased language?

tide90 commented 3 years ago

@nreimers What is the reason that multilingual models are cased? Isn't it better to have a uncased model?