When I read the code, I notice that when building the student network, the VisionTransformer is built using the default img_size (224) instead of the actual img_size (96) in the student network. That would cause the patch_embed to return the same num_patches (196) as the teacher network rather than the real num_patches (36). In other words, self.pos_embed in VisionTransformer has a fixed shape (1, 197, 768) no matter it is the student or teacher network and I think it is not reasonable. Although in the func interpolate_pos_encoding, I see patch_pos_embed is interpolated and concat with class_pos_embed to align with the actual size (1, 37, 768). While directly setting the right img_size is more reasonable way, right?
Please correct me if I am wrong. I will appreciate your help.
I see. The student network not only process regional image (96) but also global image (224). That's why we can not set a fix img_size when initializing the student network.
Thank you for sharing the great project.
When I read the code, I notice that when building the student network, the VisionTransformer is built using the default img_size (224) instead of the actual img_size (96) in the student network. That would cause the patch_embed to return the same num_patches (196) as the teacher network rather than the real num_patches (36). In other words, self.pos_embed in VisionTransformer has a fixed shape (1, 197, 768) no matter it is the student or teacher network and I think it is not reasonable. Although in the func interpolate_pos_encoding, I see patch_pos_embed is interpolated and concat with class_pos_embed to align with the actual size (1, 37, 768). While directly setting the right img_size is more reasonable way, right?
Please correct me if I am wrong. I will appreciate your help.