Best features for image similarity

Hi, open this topic because I am a bit lost in which are the best features to extract for comparing images (looking for similar images, independet of view point)

def load_dino_vit_model(weights_path):
    # Load a pre-trained DINO ViT model
    # Specify the appropriate model name and path as needed
    model_name = 'vit_small_patch16_224'  # Example model name, adjust based on actual use
    model = timm.create_model(model_name, pretrained=False, num_classes=0)  # num_classes=0 for feature extraction
    checkpoint = torch.load(weights_path, map_location='cpu')

    # Extract the 'teacher' state dictionary and remove the 'backbone.' prefix from each key
    state_dict = checkpoint['teacher']
    adapted_state_dict = {key.replace('backbone.', ''): value for key, value in state_dict.items()}

    model.load_state_dict(adapted_state_dict, strict=False)

    #model = torch.hub.load('facebookresearch/dino:main', 'dino_vitb8')
    model.eval()  # Set the model to evaluation mode
    if torch.cuda.is_available():
        model.cuda()
    return model

I am loading this model and then just do

output = model(image)

this returns a 384 (or 768) dimensial feature. Is this feature the class tokens activations? or it comes from other place?

I think if this is the case it would not be ideal as in contains positional informations, which is not the best for comparing images from different viewpoints.

Also I see that from the teacher model I am not using the mlp head that is used for training and it outputs 60k+ dim for training and comparing to the student branch.

So, If I would like to have an image feature (with pseudo-semantinc info, not positional) in the order of 2..3k dimansional, which would be the best place to get it from

Thanks

facebookresearch / dino

Best features for image similarity #272