OpenDriveLab / maskalign

[CVPR 2023] Official repository for paper "Stare at What You See: Masked Image Modeling without Reconstruction"
Apache License 2.0
62 stars 5 forks source link

Intuition behind MaskAlign #4

Closed ywyue closed 6 months ago

ywyue commented 6 months ago

Hi authors, nice work! In tab. 4, you conducted an ablation study of mask ratio and strategy. It was stated in the paper that "For alignment on the visible features, even a 0% mask ratio gains improvement compared with the teacher model". I found it interesting but also have some questions. In my understanding, with a mask ratio of 0%, the setup degenerates into feature distillation plus your proposed "Dynamic Alignment". My questions:

  1. With 0% mask ratio, the input of the student model contains the same information as the teacher’s. Then there should be no misalignment issue. Do you have any intuition why your proposed "Dynamic Alignment" can help the student model surpass the teacher model?
  2. Did you try 0% mask ratio without "Dynamic Alignment", i.e. a setup of purely feature distillation. In that case, would the student model be still better than the teacher model?

Thanks in advance!

HellwayXue commented 6 months ago

Hi, When the masking ratio is zero, the whole pipeline can be regarded as feature distillation similar to [1] (notice that both Maskalign and [1] have the whitening operation), which shows that the inferior fine-tuning performance can be improved.

"Dynamic Alignment" actually acts as an adaption module to leverage multi-level features of the teacher model. So it naturally performs better than distillation only on the last layer's output.

Hope this helps, welcome to discuss more.

[1] Wei, Yixuan, et al. "Contrastive learning rivals masked image modeling in fine-tuning via feature distillation." arXiv preprint arXiv:2205.14141 (2022).

ywyue commented 6 months ago

Thanks, it makes sense!