facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Apache License 2.0
8.74k stars 750 forks source link

How to calculate cross image similarity? #212

Open ZichengDuan opened 12 months ago

ZichengDuan commented 12 months ago

Hi, I see that cross-entropy loss is introduced in the paper as a metric to calculate the similarity between teacher results and student results, I wonder how to extend DINOv2 to the calculation of similarity between two different images.

With such cross-entropy loss supervision, in my understanding, the model learns a general representation of objects in the same categories, can I assume that the network can determine the similarity between two different images by calculating the cross-entropy? I tried to calculate the cross-entropy between different image pairs but the results seem random.

And here's my code (please ignore my dummy coding style :) ):

import torch
from PIL import Image
import torchvision.transforms as T
import hubconf
from dinov2.models.vision_transformer import vit_large
import torch.nn.functional as F

def compute_cross_entropy_similarity(feature_array1, feature_array2, temp=0.1):
    if feature_array1.shape[0] != 1: # incase the input cls token has shape [1024]
        feature_array1 = feature_array1.unsqueeze(0)
        feature_array2 = feature_array2.unsqueeze(0)

    # Step 1: Softmax
    feature_array1 = F.softmax(feature_array1 / temp, dim=1) + 1e-15
    feature_array2 = F.softmax(feature_array2 / temp, dim=1) + 1e-15

    # Step 2: entropy
    entropy = -torch.sum(feature_array1 * torch.log(feature_array2), dim=1).mean()

    return entropy

if __name__ == "__main__":
    dino_checkpoint_path = "/home/zicheng/Projects/dinov2/pretrained_models/dinov2_vitl14_pretrain.pth"

    transform = T.Compose([
        T.Resize(256, interpolation=T.InterpolationMode.BICUBIC),
        T.CenterCrop(224),
        T.ToTensor(),
        T.Normalize(mean=IMAGENET_DEFAULT_MEAN, std=IMAGENET_DEFAULT_STD),
    ])

    model.load_state_dict(torch.load(dino_checkpoint_path))
    for p in model.parameters():
        p.requires_grad = False
    # model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')  # no difference
    model.cuda()
    model.eval()`

    img1 = Image.open('/home/zicheng/Projects/dinov2/Cat02.jpg')
    img2 = Image.open('/home/zicheng/Projects/dinov2/Cat02.jpg')

   # transform = T.Compose([
   #     T.Resize(224),
   #        T.CenterCrop(224),
   #        T.ToTensor(),
   #        T.Normalize(mean=[0.5], std=[0.5]),
   #    ])

    img1 = transform(img1)[:3].unsqueeze(0)
    img2 = transform(img2)[:3].unsqueeze(0)

    with torch.no_grad():
        features1 = model.forward_features(img1.cuda())["x_norm_patchtokens"].detach().cpu().squeeze()
        features2 = model.forward_features(img2.cuda())["x_norm_patchtokens"].detach().cpu().squeeze()
        cls1 = model.forward_features(img1.cuda())["x_norm_clstoken"].detach().cpu().squeeze()
        cls2 = model.forward_features(img2.cuda())["x_norm_clstoken"].detach().cpu().squeeze()

    ce_patches_entropy = compute_cross_entropy_similarity(features1, features2) / 2 + compute_cross_entropy_similarity(features2, features1) / 2
    ce_cls_entropy = compute_cross_entropy_similarity(cls1, cls2) / 2 + compute_cross_entropy_similarity(cls2, cls1) / 2

    print(features1.shape, ce_patches_entropy, ce_cls_entropy)
Pseudo Input/Output:
CASE 1:
In: img 1 = "Cat01.jpg"
Out img2 = "Cat01.jpg"
ce_patches_entropy: 0.6807, ce_cls_entropy: 1.2190

CASE 2:
In: img 1 = "Cat01.jpg"
Out img2 = "Dog01.jpg"
ce_patches_entropy: 35, ce_cls_entropy: 29

CASE 3:
In: img 1 = "Cat01.jpg"
Out img2 = "Cat02.jpg"
ce_patches_entropy: 28, ce_cls_entropy: 34

Does anyone have any insights about these cases? Am I on the right track?

ZichengDuan commented 12 months ago

I also tried dot product (cosine similarity) between the output class token embeddings, and the similarity scores are also random. In case of implementation error, I tried 3 implementations (from knn.py, from PyTorch, from my own implementation) and all of them consistently returned the same result:

Pseudo Input/Output:
CASE 1:
In: img 1 = "Cat01.jpg"
Out img2 = "Cat01.jpg"
cosine similarity: 1

CASE 2:
In: img 1 = "Cat01.jpg"
Out img2 = "Dog01.jpg"
cosine similarity: 0.0227

CASE 3:
In: img 1 = "Cat01.jpg"
Out img2 = "Cat02.jpg"
cosine similarity: -0.0247

Below are the images I used: Cat01.jpg Cat01

Dog01.jpg Dog01

WANGSSSSSSS commented 12 months ago

https://github.com/facebookresearch/dinov2/blob/main/dinov2/data/transforms.py#L42 i guess your transform function is overloaded by the second one, which uses a wrong mean and variance value


transform = T.Compose([
        T.Resize(224),
        T.CenterCrop(224),
        T.ToTensor(),
        T.Normalize(mean=[0.5], std=[0.5]),
    ])

    img1 = transform(img1)[:3].unsqueeze(0)
    img2 = transform(img2)[:3].unsqueeze(0)
ZichengDuan commented 12 months ago

https://github.com/facebookresearch/dinov2/blob/main/dinov2/data/transforms.py#L42 i guess your transform function is overloaded by the second one, which uses a wrong mean and variance value

transform = T.Compose([
        T.Resize(224),
        T.CenterCrop(224),
        T.ToTensor(),
        T.Normalize(mean=[0.5], std=[0.5]),
    ])

    img1 = transform(img1)[:3].unsqueeze(0)
    img2 = transform(img2)[:3].unsqueeze(0)

Sorry for causing such a misunderstanding, I deleted this transform function in my own code and I simply forgot to delete it here. Thanks for mentioning this, I have edited the code above. :)

qasfb commented 12 months ago

Did you get a chance to try a simple L2 distance between the [CLS] tokens ? This would be close to what we use with the knn classifier.