UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.47k stars 2.5k forks source link

distiluse-base-multilingual-cased giving better accuracy #367

Open ankitkr3 opened 4 years ago

ankitkr3 commented 4 years ago

Hi @nreimers I was testing English sentence similarity using other bert and roberta models but unexpectedly distiluse-base-multilingual-cased is giving me better accuracy. Can you please explain to me why is it happening?

nreimers commented 4 years ago

Hi @ankitkr3 On which type of data are you testing it?

DistilUSE is a distillation of Universal Sentence Encoder, that was trained (besides other) on large scale question - response from various internet communities like Reddit, StackOverflow etc. It works quite good for more noisy data or for more specific data.

The BERT models were trained on NLI + STSb data, which are cleaner and smaller in the type of topics. These models work better for general sentences (without noise and without a narrow domain).

Best Nils Reimers

ankitkr3 commented 4 years ago

@nreimers i am using it to find similarity between paragraphs or consider it as student answers which can vary with length.