facebookresearch / dino

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO
Apache License 2.0
6.36k stars 908 forks source link

Metric for k-NN on ViT features? #62

Closed GerardMaggiolino closed 3 years ago

GerardMaggiolino commented 3 years ago

Looking through the k-NN code it appears that the metric for feature distance is the dot product. The features are unnormalized; the layer norm has element-wise affine parameters, so they're not necessarily z-scored either.

I'm wondering if I'm misinterpreting this, or if this was the best metric found by experimentation, as unlike SwAV the network isn't trained over an explicit metric space for the embeddings. On a medium sized, aggregated, proprietary dataset I'm finding that both euclidean L2 and cosine similarly outperform the unnormalized dot product.

For copy detection, cosine distance is used over augmented features with GeM pooled patch tokens as well - is there a reason these features and the cosine metric wasn't used for k-NN evaluation too?

mathildecaron31 commented 3 years ago

Hi @GerardMaggiolino

The features are normalized in the k-NN code: https://github.com/facebookresearch/dino/blob/4b96393c4c877d127cff9f077468e4a1cc2b5e2d/eval_knn.py#L69-L71

Therefore we use cosine distance as similarity measure.

For copy detection, cosine distance is used over augmented features with GeM pooled patch tokens as well - is there a reason these features and the cosine metric wasn't used for k-NN evaluation too?

We wanted to keep the k-NN evaluation/feature extraction as simple as possible. It is definitely possible that concatenating GeM pooled patch token features would improve the performance for k-NN eval on ImageNet (I have not tried that).

woctezuma commented 3 years ago

Mathilde answered regarding the metric (features are actually normalized before computing the dot-product).

About the features, I would like to add that there is another case for the Linear evaluation, which differs from the kNN case and from the copy-detection case. The linear classifier is trained with these features:

https://github.com/facebookresearch/dino/blob/4b96393c4c877d127cff9f077468e4a1cc2b5e2d/eval_linear.py#L141-L146

I have played a bit with these features for image similarity in one hobby project, and I would love to see an "official" implementation of the feature computation used for the copy-detection case, so that I could directly plug it in my hobby project. This way, I could qualitatively compare the results obtained with my image dataset, and be sure that there is no bug in the code.

Illustration

GerardMaggiolino commented 3 years ago

@mathildecaron31 Not sure how I missed that, thank you for pointing it out! All of your self supervised papers have been awesome :)

@woctezuma I'm not sure what exponent GeM pooling is used with, and I think it's possible it'd be across dim=2 from your code? It depends what you interpret as the "feature maps" of ViT. I'd think it'd be the output at each head, but given that information is mixed globally, there isn't really a comparison to CNN maps which are spatially consistent. Either way that looks good, assuming n=1.