Advantage of separate classifier layer for distillation?

THU-MIG / RepViT

RepViT: Revisiting Mobile CNN From ViT Perspective [CVPR 2024] and RepViT-SAM: Towards Real-Time Segmenting Anything

https://arxiv.org/abs/2307.09283

Apache License 2.0

730 stars 55 forks source link

Advantage of separate classifier layer for distillation? #8

Closed chinmayjog13 closed 1 year ago

chinmayjog13 commented 1 year ago

I noticed there are two linear layers in the classifier, one which goes to the regular loss and one which is used in distillation loss. What is the advantage of having two separate layers and averaging the outputs during inference? Why not use only one layer for both losses?

jameslahm commented 1 year ago

Thanks, we follow LeViT (No classification token. in the Model section) to use separate linear layers for the regular loss and distillation loss, respectively. We didn't conduct ablation studies between only one layer and two separate layers, so we are unsure whether using two layers has the advantage. Experiments are expected to investigate this.

chinmayjog13 commented 1 year ago

thank you