facebookresearch / dino

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO
Apache License 2.0
6.06k stars 885 forks source link

Why not setting correct img_size when building the student network? #266

Closed LichunZhang closed 7 months ago

LichunZhang commented 7 months ago

Thank you for sharing the great project.

When I read the code, I notice that when building the student network, the VisionTransformer is built using the default img_size (224) instead of the actual img_size (96) in the student network. That would cause the patch_embed to return the same num_patches (196) as the teacher network rather than the real num_patches (36). In other words, self.pos_embed in VisionTransformer has a fixed shape (1, 197, 768) no matter it is the student or teacher network and I think it is not reasonable. Although in the func interpolate_pos_encoding, I see patch_pos_embed is interpolated and concat with class_pos_embed to align with the actual size (1, 37, 768). While directly setting the right img_size is more reasonable way, right?

Please correct me if I am wrong. I will appreciate your help.

LichunZhang commented 7 months ago

I see. The student network not only process regional image (96) but also global image (224). That's why we can not set a fix img_size when initializing the student network.