UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.83k stars 2.44k forks source link

Question: How did you train `xlm-r-100langs-bert-base-nli-stsb-mean-tokens`? #469

Closed PhilipMay closed 3 years ago

PhilipMay commented 3 years ago

Hi, as you might know I open sourced a German translation of the stsb dataset: https://github.com/t-systems-on-site-services-gmbh/german-STSbenchmark

I tested xlm-r-100langs-bert-base-nli-stsb-mean-tokens on the test set of stsb and it performes surprisingly good. The question I have: Did you train it on the full (englich) dataset (including test) and then multi language trained it? That would be the reason why it performs so good...

I do not know why it performs so good and would like to understand.

Thanks Philip

nreimers commented 3 years ago

Hi Philip, it uses bert-base-nli-stsb-mean-tokens as teacher and xlm-r als student model (https://arxiv.org/abs/2004.09813)

The bert model was trained on SNLI+MultiNLI and on the STSb train set (tuned on STSb dev set). No data from STSb test was used.

Pre-training with NLI data makes quite a big difference for STS.

PhilipMay commented 3 years ago

Ok - thanks.