Do feature vectors for DINOv2 include small objects?

facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.

Apache License 2.0

8.89k stars 777 forks source link

Do feature vectors for DINOv2 include small objects? #397

Open smandava98 opened 6 months ago

smandava98 commented 6 months ago

Hi,

When I visualize the features via PCA I'm able to see small objects but I'm not sure if this means the 1024 feature vector or ViT-L from DINOv2 must include spatial information of small objects relative to larger objects in the image?

Also, how can I properly reason about when to use the patch tokens vs the final embedded vector that the model returns is I am trying to use it to build a video object detection model, which would predict accurate bounding boxes over frames?

Currently, I just use that final 1024 vector but not sure if I should use patch tokens as that would be a lot if I am operating on video.

nourihilscher commented 6 months ago

As far as I know, I would recommend using the patch tokens for your case. The final class token characterizes the image as a complete entity, giving you the ability to compare the overall content of two images with each other. The patch tokens characterize the content of each 14x14 image patch. Remember that you can down scale the original images to reduce the overall number of patch tokens (has to be a multiple of 14 of course). Downscaling reduces the quality of the image which is bad for segmentation models, but in your case, as you are only interested in a bounding box, this should be fine.

I am actually curios how you visualized the final class token using PCA. How did you retrieve an image back just from the class token?

smandava98 commented 6 months ago

Oh I used the patch tokens for PCA, not the class token.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
with torch.no_grad():
    features_dict = dinov2_vitl14.forward_features(imgs_tensor_orig.cuda())
    features = features_dict['x_norm_patchtokens']
batch_size = features.shape[0]
scale1 = features.shape[1]
features2 = features.reshape(batch_size*scale1, 1024) #ViTG is 1536, ViTL is 1024
features2 = features2.cpu()
pca = PCA(n_components=1)
pca.fit(features2)
pca_features1 = pca.transform(features2)

# Visualize the first PCA component
for i in range(batch_size):
    plt.subplot(1, batch_size, i+1)
    plt.imshow(features2[i * scale1: (i+1) * scale1, 0].reshape(91, 52))
plt.show()

Is there any benefit for prepending the CLS token to the patch tokens before passing into my model? Or would just the patch tokens suffice?

nourihilscher commented 6 months ago

I think the patch tokens should suffice, but I didn't try to include the class tokens. If you prepend them to your patch embeddings by concatenating, I would assume that the eigenvectors PCA is projecting to, should not change much, as these new additional parts have very low variance.

(By the way, in addition to down scaling the image, you probably also don't need to process every frame of your video. Maybe every k-th frame is enough if you smoothly interpolate between the positions of you bounding box around objects)

LiAo365 commented 4 months ago

I have the same question, too. From the given segmentation notebook, if 1024-d image representation could be used for segmentation, it should not a problem to use them for object detection. However, I find that others works liking Grounding-DINO, Video Grounding-DINO, do this by obtaining multi-scale features from the image backbone for each frame, and I am also curious if 1024-d feature representation is suitable for video object detection.

LiAo365 commented 4 months ago

@smandava98 May I ask how effective you are if you just use 1024 dimensional features? Thanks!!!