@splinter21 , could you please rephrase the description into English so that other readers can understand it? Thanks!
Generally, it's possible that the student can get higher accuracy that the teacher model. One example is the teacher model is identical to the student model, but you can expect the student model to achieve higher accuracy. Though I have no theoretical analysis of it, my guess is that it can be thought of from either the ensemble perspective or label smoothing perspective.
@splinter21 , could you please rephrase the description into English so that other readers can understand it? Thanks!
Generally, it's possible that the student can get higher accuracy that the teacher model. One example is the teacher model is identical to the student model, but you can expect the student model to achieve higher accuracy. Though I have no theoretical analysis of it, my guess is that it can be thought of from either the ensemble perspective or label smoothing perspective.