Closed chinmayjog13 closed 1 year ago
Thanks, we follow LeViT (No classification token. in the Model section) to use separate linear layers for the regular loss and distillation loss, respectively. We didn't conduct ablation studies between only one layer and two separate layers, so we are unsure whether using two layers has the advantage. Experiments are expected to investigate this.
thank you
I noticed there are two linear layers in the classifier, one which goes to the regular loss and one which is used in distillation loss. What is the advantage of having two separate layers and averaging the outputs during inference? Why not use only one layer for both losses?