About the expriment setting (pre-trained weights)

Thank you for your question. Because of the lack of inductive bias compared to CNNs, Transformer depends on the time-consuming pre-training process on a large-scale dataset. It's meaningful and possible to replace the pre-training process, if an efficient model without cumbersome version, such as MobileNet, is proposed. Besides, this setting, knowledge distillation with pretrained teacher and non-pretrained student, is widely used in Transformer-based Knowledge Distillation in NLP, e.g. TinyBERT. Although we also achieve an obvious improvement over our baseline Knowledge Review with a pre-trained student, we think it's more meaningful to replace the time-consuming pre-training process.

RuipingL / TransKD

About the expriment setting (pre-trained weights) #2