Closed ywyue closed 6 months ago
Hi, When the masking ratio is zero, the whole pipeline can be regarded as feature distillation similar to [1] (notice that both Maskalign and [1] have the whitening operation), which shows that the inferior fine-tuning performance can be improved.
"Dynamic Alignment" actually acts as an adaption module to leverage multi-level features of the teacher model. So it naturally performs better than distillation only on the last layer's output.
Hope this helps, welcome to discuss more.
[1] Wei, Yixuan, et al. "Contrastive learning rivals masked image modeling in fine-tuning via feature distillation." arXiv preprint arXiv:2205.14141 (2022).
Thanks, it makes sense!
Hi authors, nice work! In tab. 4, you conducted an ablation study of mask ratio and strategy. It was stated in the paper that "For alignment on the visible features, even a 0% mask ratio gains improvement compared with the teacher model". I found it interesting but also have some questions. In my understanding, with a mask ratio of 0%, the setup degenerates into feature distillation plus your proposed "Dynamic Alignment". My questions:
Thanks in advance!