Why not setting correct img_size when building the student network?

Thank you for sharing the great project.

When I read the code, I notice that when building the student network, the VisionTransformer is built using the default img_size (224) instead of the actual img_size (96) in the student network. That would cause the patch_embed to return the same num_patches (196) as the teacher network rather than the real num_patches (36). In other words, self.pos_embed in VisionTransformer has a fixed shape (1, 197, 768) no matter it is the student or teacher network and I think it is not reasonable. Although in the func interpolate_pos_encoding, I see patch_pos_embed is interpolated and concat with class_pos_embed to align with the actual size (1, 37, 768). While directly setting the right img_size is more reasonable way, right?

Please correct me if I am wrong. I will appreciate your help.

facebookresearch / dino

Why not setting correct img_size when building the student network? #266