In the 2 result tables, WRN-40-2, as the teacher, after distilling the students, the students get higher performance（CRD+KD）, why?

HobbitLong / RepDistiller

[ICLR 2020] Contrastive Representation Distillation (CRD), and benchmark of recent knowledge distillation methods

BSD 2-Clause "Simplified" License

2.12k stars 391 forks source link

In the 2 result tables, WRN-40-2, as the teacher, after distilling the students, the students get higher performance（CRD+KD）, why? #9

Closed splinter21 closed 4 years ago

HobbitLong commented 4 years ago

@splinter21 , could you please rephrase the description into English so that other readers can understand it? Thanks!

Generally, it's possible that the student can get higher accuracy that the teacher model. One example is the teacher model is identical to the student model, but you can expect the student model to achieve higher accuracy. Though I have no theoretical analysis of it, my guess is that it can be thought of from either the ensemble perspective or label smoothing perspective.