Hi, thanks for such great work ! I was wondering which features are used for this loss --- do we use intermediate features or the final encoder features?
Also, if the student and teacher feature dimensions are different, what kind of projection is used to bring them to a compatible feature space?
We use the same student-teacher structure (e.g., both ViT-Large) for alignment, so the dimensions are the same. If the dimensions are different, we recommend adding a linear projection layer on top of the student features.
Hi, thanks for such great work ! I was wondering which features are used for this loss --- do we use intermediate features or the final encoder features?
Also, if the student and teacher feature dimensions are different, what kind of projection is used to bring them to a compatible feature space?