Why student_model(paraphrase-xlm-r-multilingual-v1) performs much better than teacher_model(paraphrase-distilroberta-base-v1) on an english dataset similar to STS?

UKPLab / sentence-transformers

State-of-the-Art Text Embeddings

https://www.sbert.net

Apache License 2.0

15.2k stars 2.47k forks source link

Why student_model(paraphrase-xlm-r-multilingual-v1) performs much better than teacher_model(paraphrase-distilroberta-base-v1) on an english dataset similar to STS? #1172

Open Lier007 opened 3 years ago

Lier007 commented 3 years ago

Thank you for posting such an excellent repo. We have learned a lot from it. I have a custom english dataset similar to STS. When I evaluate the dataset with paraphrase-xlm-r-multilingual-v1, the result(spearman=0.756) is unexpectedly good. But when I evaluate the dataset with paraphrase-distilroberta-base-v1(teacher), the result(spearman=0.65) has dropped by 10 points. So i don't know why this happens.

Lier007 commented 3 years ago

I also found that multilingual model(paraphrase-xlm-r-multilingual-v1) is very difficult to distill. When I set 10 different seeds, only one of them can converge to a good spearman result, but the remaining 9 seeds cannot converge at all. Although all MSE losses can drop to the same level.

nreimers commented 3 years ago

Sounds weird. Are you sure you have a good test set and measure everything correctly?

Lier007 commented 3 years ago

Are you sure you have a good test set and measure everything correctly

Yes, I am pretty sure. Because the result is so strange, so I manually looked at dozens of samples with large differences in similarity. It turns out that the student_model is unexpectedly better.

Lier007 commented 3 years ago

By setting a threshold and using paraphrase-xlm-r-multilingual-v1 for chinese paraphrase classification, the acc is even comparable to humans without professional knowledge. I was shocked, so I want to continue to do something for this incredible model. Then I find that distillation with it is much more difficult than teacher_model.

Punchwes commented 3 years ago

Sounds weird. Are you sure you have a good test set and measure everything correctly?

I found something similar, but the scenario is directly fine-tuning SBERT on downstream tasks. It turns out for some tasks (MRPC), distilBERT significantly outperforms BERT in the SBERT structure.