Open Hongbo-Z opened 2 months ago
I think that's because the student model only has a 'short-term memory'. EMA encourages the teacher model to take a lot more images into account (long-term memory) so that it can map images to a more uniform distribution compared to the student.
Thank you for sharing the great project.
I noticed that in the paper, they said, 'We observe that this teacher has better performance than the student throughout the training, and hence, guides the training of the student by providing target features of higher quality'.
But I didn't find a theoretical explanation for why this happened ?
Also, how can we observe this during SSL training? And which metric is used to evaluate the performance?