Closed MichelBartels closed 2 years ago
@MichelBartels Can you please post the results here and close the issue?
Yes, here are the results: Distillation with BERT-base as teacher: EM: 71.6% F1: 76.2% Distillation with BERT-large as teacher: EM: 71.2% F1: 76.4% Because of those results, it does not seem promising to use large teachers. However, there is also the upside that the distillation process seems to be very stable considering the prediction quality is very similar.
As a next step to distilling better language models, we want to explore the difference between a distilling from a base model and a large model.
For this, we would need to decide on a dataset:
We also need to consider changing the model architecture:
After that, we should be able to use the existing scripts. Due to being slower with BERT-Large, this should take (very rough estimate) about 5 to 6 days on 4 GPUs. In case we use the English Wikipedia and do not reintroduce
StreamingDataSilo
, this code changes should be achievable in about one day.