Open Lier007 opened 3 years ago
I also found that multilingual model(paraphrase-xlm-r-multilingual-v1) is very difficult to distill. When I set 10 different seeds, only one of them can converge to a good spearman result, but the remaining 9 seeds cannot converge at all. Although all MSE losses can drop to the same level.
Sounds weird. Are you sure you have a good test set and measure everything correctly?
Are you sure you have a good test set and measure everything correctly
Yes, I am pretty sure. Because the result is so strange, so I manually looked at dozens of samples with large differences in similarity. It turns out that the student_model is unexpectedly better.
By setting a threshold and using paraphrase-xlm-r-multilingual-v1 for chinese paraphrase classification, the acc is even comparable to humans without professional knowledge. I was shocked, so I want to continue to do something for this incredible model. Then I find that distillation with it is much more difficult than teacher_model.
Sounds weird. Are you sure you have a good test set and measure everything correctly?
I found something similar, but the scenario is directly fine-tuning SBERT on downstream tasks. It turns out for some tasks (MRPC), distilBERT significantly outperforms BERT in the SBERT structure.
Thank you for posting such an excellent repo. We have learned a lot from it. I have a custom english dataset similar to STS. When I evaluate the dataset with paraphrase-xlm-r-multilingual-v1, the result(spearman=0.756) is unexpectedly good. But when I evaluate the dataset with paraphrase-distilroberta-base-v1(teacher), the result(spearman=0.65) has dropped by 10 points. So i don't know why this happens.