Hi, open this topic because I am a bit lost in which are the best features to extract for comparing images (looking for similar images, independet of view point)
def load_dino_vit_model(weights_path):
# Load a pre-trained DINO ViT model
# Specify the appropriate model name and path as needed
model_name = 'vit_small_patch16_224' # Example model name, adjust based on actual use
model = timm.create_model(model_name, pretrained=False, num_classes=0) # num_classes=0 for feature extraction
checkpoint = torch.load(weights_path, map_location='cpu')
# Extract the 'teacher' state dictionary and remove the 'backbone.' prefix from each key
state_dict = checkpoint['teacher']
adapted_state_dict = {key.replace('backbone.', ''): value for key, value in state_dict.items()}
model.load_state_dict(adapted_state_dict, strict=False)
#model = torch.hub.load('facebookresearch/dino:main', 'dino_vitb8')
model.eval() # Set the model to evaluation mode
if torch.cuda.is_available():
model.cuda()
return model
I am loading this model and then just do
output = model(image)
this returns a 384 (or 768) dimensial feature. Is this feature the class tokens activations? or it comes from other place?
I think if this is the case it would not be ideal as in contains positional informations, which is not the best for comparing images from different viewpoints.
Also I see that from the teacher model I am not using the mlp head that is used for training and it outputs 60k+ dim for training and comparing to the student branch.
So, If I would like to have an image feature (with pseudo-semantinc info, not positional) in the order of 2..3k dimansional, which would be the best place to get it from
Hi, open this topic because I am a bit lost in which are the best features to extract for comparing images (looking for similar images, independet of view point)
I am loading this model and then just do
output = model(image)
this returns a 384 (or 768) dimensial feature. Is this feature the class tokens activations? or it comes from other place?
I think if this is the case it would not be ideal as in contains positional informations, which is not the best for comparing images from different viewpoints.
Also I see that from the teacher model I am not using the mlp head that is used for training and it outputs 60k+ dim for training and comparing to the student branch.
So, If I would like to have an image feature (with pseudo-semantinc info, not positional) in the order of 2..3k dimansional, which would be the best place to get it from
Thanks