About the CE loss - Githubissues

Thanks for your sharing. Did all of experiments distillation work with the CE loss? I have a problem about this training strategy. First , well trained teacher model fixed parameters, then add one dimension linear transfer layer before the last classification layer of teacher and student model respectively, this linear transfer layer is trainable as the student model. But if you have the CE loss , you add after the original teacher student models' last layer. CE loss doesn't have any relationship with your linear dimension transfer layer. I feel this is a little strange. Your linear transfer layer has no connection with your final classification task. How could this layer learning. And another question is if my student and teacher model's penultimate layer have the same dimension, can I drop the linear dimension transfer layer? Thanks very much for your reply.

HobbitLong / RepDistiller

About the CE loss #33