facebookresearch / dino

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO
Apache License 2.0
6.23k stars 905 forks source link

Are the pretrained weights from the student or teacher model? #44

Closed bfialkoff closed 3 years ago

bfialkoff commented 3 years ago

I dont understand which of the two models are later used for inference is it the student or teach? Are the pretrained weights provided from the teacher or the student network?

woctezuma commented 3 years ago

I dont understand which of the two models are later used for inference is it the student or teach?

~The goal is to train a student. Same as in real life. The teacher is only an expendable mean towards that goal.~ Edit: See the answer by the first author below!

Student

Are the pretrained weights provided from the teacher or the student network?

Everything is provided.

Everything

bfialkoff commented 3 years ago

Thanks for the clarification. I guess what I meant was in the video_generation script when we load a model, we are then loading the student or the backbone? Backbone to the base model and head refers to the part of the architecture that turns it into the student model?

mathildecaron31 commented 3 years ago

Hi @bfialkoff

The weights from the backbone only files are from the teacher, and results from our paper are obtained from the teacher weights as well. We have indeed shown in our paper that the teacher is performing better than the student in general.

Therefore, when using the video_generation script it is loading from teacher weights (though the visualization are nearly the same if you use the student weights in that case).

For any of our evaluation scripts, if you want to evaluate the student weights instead you can do so by specifying the path towards the full checkpoint with the --pretrained_weights argument and specifying --checkpoint_key student.

Hope that helps

mathildecaron31 commented 3 years ago

@woctezuma thanks for helping to reply to this issue.

I have a minor remark. Our ultimate goal is to obtain the best model possible in an unsupervised way. We train the student with SGD and the teacher is an EMA of that student. We've found that the teacher is performing better than the student and that is why our final model used in downstream tasks is the teacher.

woctezuma commented 3 years ago

@woctezuma thanks for helping to reply to this issue.

I have a minor remark. Our ultimate goal is to obtain the best model possible in an unsupervised way. We train the student with SGD and the teacher is an EMA of that student. We've found that the teacher is performing better than the student and that is why our final model used in downstream tasks is the teacher.

Oops, it looks like I was confused about that! Thanks for clearing that up!

Hopefully I have not confused others! Sorry about that, @bfialkoff!