Why Teacher network perform bettert than Student one during training?

facebookresearch / dino

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO

Apache License 2.0

6.06k stars 885 forks source link

Why Teacher network perform bettert than Student one during training? #274

Open Hongbo-Z opened 2 months ago

Hongbo-Z commented 2 months ago

Thank you for sharing the great project.

I noticed that in the paper, they said, 'We observe that this teacher has better performance than the student throughout the training, and hence, guides the training of the student by providing target features of higher quality'.

But I didn't find a theoretical explanation for why this happened ?

Also, how can we observe this during SSL training? And which metric is used to evaluate the performance?

wangh09 commented 3 hours ago

I think that's because the student model only has a 'short-term memory'. EMA encourages the teacher model to take a lot more images into account (long-term memory) so that it can map images to a more uniform distribution compared to the student.