Distilling BERT-Large - Githubissues

MichelBartels commented 2 years ago

As a next step to distilling better language models, we want to explore the difference between a distilling from a base model and a large model.

For this, we would need to decide on a dataset:

English Wikipedia: We could probably use the version from SimplifiedTinyBERT: https://github.com/cxa-unique/Simplified-TinyBERT#resources Advantages: No conversion necessary; good comparability with original TinyBERT. Disadvantages: Short sequence length (at least when reading line by line; could not matter that much due to distillation; however no new line characters)
The Pile: Advantages: Very high quality, does actually distinguish between documents and paragraphs. Disadvantages: Less comparability, format slightly more complicated -> Probably better to use in later experiments

We also need to consider changing the model architecture:

BERT-Large has more attention heads so TinyBERT needs to have more attention heads as well.
Perhaps we could reduce hidden dims to counteract an increase in parameters?

After that, we should be able to use the existing scripts. Due to being slower with BERT-Large, this should take (very rough estimate) about 5 to 6 days on 4 GPUs. In case we use the English Wikipedia and do not reintroduce StreamingDataSilo, this code changes should be achievable in about one day.

tholor commented 2 years ago

@MichelBartels Can you please post the results here and close the issue?

MichelBartels commented 2 years ago

Yes, here are the results: Distillation with BERT-base as teacher: EM: 71.6% F1: 76.2% Distillation with BERT-large as teacher: EM: 71.2% F1: 76.4% Because of those results, it does not seem promising to use large teachers. However, there is also the upside that the distillation process seems to be very stable considering the prediction quality is very similar.

deepset-ai / haystack

Distilling BERT-Large #2019