The core argument of your article is that after feature distillation the pre-trained model exhibits properties similar to those of the MIM model. But the problem is where do the properties come from?
For instance, MIM models have the locality because of the masked modeling mechanism. Otherwise, you use the same augmentation view for the teacher and student. So it's quite confusing where these properties come from.
The core argument of your article is that after feature distillation the pre-trained model exhibits properties similar to those of the MIM model. But the problem is where do the properties come from?
For instance, MIM models have the locality because of the masked modeling mechanism. Otherwise, you use the same augmentation view for the teacher and student. So it's quite confusing where these properties come from.