HobbitLong / RepDistiller

[ICLR 2020] Contrastive Representation Distillation (CRD), and benchmark of recent knowledge distillation methods
BSD 2-Clause "Simplified" License
2.11k stars 389 forks source link

About the CE loss #33

Open XiXiRuPan opened 3 years ago

XiXiRuPan commented 3 years ago

Thanks for your sharing. Did all of experiments distillation work with the CE loss? I have a problem about this training strategy. First , well trained teacher model fixed parameters, then add one dimension linear transfer layer before the last classification layer of teacher and student model respectively, this linear transfer layer is trainable as the student model. But if you have the CE loss , you add after the original teacher student models' last layer. CE loss doesn't have any relationship with your linear dimension transfer layer. I feel this is a little strange. Your linear transfer layer has no connection with your final classification task. How could this layer learning. And another question is if my student and teacher model's penultimate layer have the same dimension, can I drop the linear dimension transfer layer? Thanks very much for your reply.