Closed RongKaiWeskerMA closed 2 years ago
Hello, thank you for your interest in our work. We adopt the scheme of Good-Embed [37] to achieve self-distillation, which distills knowledge from a trained model (teacher model) to a new model (student model), and the teacher model is fixed in the process. In addition, the network structure and training data of the teacher model and the student model are consistent.
Hi, thanks for publishing the code. It is really an interesting work!
I have a question in terms of the self-distillation used in the pre-training stage. It looks like the teacher model is fixed in the beginning and the knowledge is distilled into a new student model over the iteration. So it is not really sequential (i.e. the previous student becomes the teacher in the next iteration)?