Implementing distillation loss functions from TinyBERT

Is your feature request related to a problem? Please describe. A basic version of model distillation was implemented with #1758. However, there is still room for improvement. The TinyBERT paper (https://arxiv.org/pdf/1909.10351.pdf) details an approach for finetuning an already pretrained small language model.

Describe the solution you'd like The distillation loss functions in the TinyBERT paper should be usable when distilling a model in haystack using the distil_from method.

Describe alternatives you've considered https://arxiv.org/pdf/1910.08381.pdf: Seems to depend too heavily on expensive retraining and seems to be too task specific. https://arxiv.org/pdf/2002.10957.pdf, https://arxiv.org/pdf/1910.01108.pdf: Seem only to focus on pretraining

Additional context This is the first of two issues for implementing finetuning as described in the TinyBERT paper. This issue focusses on the loss functions. The second issue focusses on data augmentation.

deepset-ai / haystack

Implementing distillation loss functions from TinyBERT #1873