RuipingL / TransKD

15 stars 3 forks source link

About the expriment setting (pre-trained weights) #2

Closed botaoye closed 2 years ago

botaoye commented 2 years ago

Hi, thanks for your work. I noticed that you did not use pre-trained weights for the student model (ImageNet-1K pre-trained, as I understand it). In Tab. 2, the performance after distillation is worse than the performance using pre-trained weights. So, I'm curious why you don't use the ImageNet-1K pre-trained weights for the student model since this is a more common setting.

RuipingL commented 1 year ago

Thank you for your question. Because of the lack of inductive bias compared to CNNs, Transformer depends on the time-consuming pre-training process on a large-scale dataset. It's meaningful and possible to replace the pre-training process, if an efficient model without cumbersome version, such as MobileNet, is proposed. Besides, this setting, knowledge distillation with pretrained teacher and non-pretrained student, is widely used in Transformer-based Knowledge Distillation in NLP, e.g. TinyBERT. Although we also achieve an obvious improvement over our baseline Knowledge Review with a pre-trained student, we think it's more meaningful to replace the time-consuming pre-training process.