deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.24k stars 1.89k forks source link

Distilling BERT-Large #2019

Closed MichelBartels closed 2 years ago

MichelBartels commented 2 years ago

As a next step to distilling better language models, we want to explore the difference between a distilling from a base model and a large model.

For this, we would need to decide on a dataset:

We also need to consider changing the model architecture:

After that, we should be able to use the existing scripts. Due to being slower with BERT-Large, this should take (very rough estimate) about 5 to 6 days on 4 GPUs. In case we use the English Wikipedia and do not reintroduce StreamingDataSilo, this code changes should be achievable in about one day.

tholor commented 2 years ago

@MichelBartels Can you please post the results here and close the issue?

MichelBartels commented 2 years ago

Yes, here are the results: Distillation with BERT-base as teacher: EM: 71.6% F1: 76.2% Distillation with BERT-large as teacher: EM: 71.2% F1: 76.4% Because of those results, it does not seem promising to use large teachers. However, there is also the upside that the distillation process seems to be very stable considering the prediction quality is very similar.