Open tcourat opened 10 months ago
It seems that the following paper : Analyzing Local Representations of Self-supervised Vision Transformers has made an extensive analysis of this question across different kind of architectures (including DINOv2, green curve).
So at least for DINOv2, using the features at the very-end seems the right choice.
Hi,
I was reading the last paper from "Scalable Pre-training of Large Autoregressive Image Models " ( Apple https://arxiv.org/abs/2401.08541 ) and they made the observation that the best features to use for downstram tasks werent the features from the last layer, but features from some intermediate layer :
I wonder if same bevahior is still valid for DinoV2. In my specific case I extract these features to compute similarity between patches of images.