facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Apache License 2.0
9.23k stars 824 forks source link

Best layer to extract features for embedding similarity computations #366

Open tcourat opened 10 months ago

tcourat commented 10 months ago

Hi,

I was reading the last paper from "Scalable Pre-training of Large Autoregressive Image Models " ( Apple https://arxiv.org/abs/2401.08541 ) and they made the observation that the best features to use for downstram tasks werent the features from the last layer, but features from some intermediate layer :

fig10

I wonder if same bevahior is still valid for DinoV2. In my specific case I extract these features to compute similarity between patches of images.

tcourat commented 8 months ago

It seems that the following paper : Analyzing Local Representations of Self-supervised Vision Transformers has made an extensive analysis of this question across different kind of architectures (including DINOv2, green curve).

dinov2_layers

So at least for DINOv2, using the features at the very-end seems the right choice.