facebookresearch / dinov2

PyTorch code and models for the DINOv2 self-supervised learning method.
Apache License 2.0
9.12k stars 812 forks source link

Could you tell me how to visualize PCA 3-channel image just like the showcase in your README #23

Closed shoutOutYangJie closed 1 year ago

shoutOutYangJie commented 1 year ago

If you can share the code for visualizing, it must be a big&nice contribution

woctezuma commented 1 year ago

Loosely related:

shoutOutYangJie commented 1 year ago

Loosely related:

the quality is very different

woctezuma commented 1 year ago

That is because the user did not keep the first 3 components, etc. See the caption of the first figure in the article.

shoutOutYangJie commented 1 year ago

I don't know what "threshold the first componet and execute second PCA" means. Please help me @woctezuma

shoutOutYangJie commented 1 year ago

I try my best, But I can't remove the background.

image image

woctezuma commented 1 year ago

I think you need to find a threshold for the first component. From what I understand, this removes the background. I have not checked how the threshold is computed for the figure in the paper, I guess automatically.

Then you can use the first three components to visualize the areas in RGB.

I have not tested, but that is my understanding of the figure.

shoutOutYangJie commented 1 year ago

I think you need to find a threshold for the first component. From what I understand, this removes the background. I have not checked how the threshold is computed for the figure in the paper, I guess automatically.

Then you can use the first three components to visualize the areas in RGB.

I have not tested, but that is my understanding of the figure.

If I want to first 3 componets, why need I find a threshold for first component? I am very confused. either One or three? if I remove the other componets which eigen value is less than the first compoent, there is only one component, rather three.

ccharest93 commented 1 year ago

background is one of the major component, (part of the first one often) that is why the threshold needs to be applied there. Screenshot 2023-04-19 143603

woctezuma commented 1 year ago

With the thresholded first component, you would be able to tell the foregroud from the background (in black). Then you would get the colors with the first three components. That is what I understand. Not tested by myself.

Paper

shoutOutYangJie commented 1 year ago

Let we discuss in details!
The token features extracted from network is X (shape is 256x1536, 256 comes from 16x16) Firstly, we start to use PCA,

eigen_values, eigen_vectors = eig(X.T.dot(X))

the vector shape is 1536 x 1536. Secondly, I sort the eigen values to remove the first component, and filter whose value is less than 0 and I execute second PCA, directly decreasing dimension to 3 (RGB channel)

Is there something wrong?

@woctezuma @ccharest93

ccharest93 commented 1 year ago

I havent gotten there to this point, but let me see if i can help So assuming you have 16 x 16 patches , and an embedding dimension of 1536 for each patch, your shape will be out = [256,1536].

Then do PCA

with resulting matrix if 1536x1536 which forms a basis for our embedding dimension. then sort the eigenvectors by eigenvalue, get a decomposition of your original 256x1536 embedding vector in terms of this basis,

Then this is where you have to figure out which components you want to keep and threshold, you could try: remove the basis vector with the highest eigenvalue and take the next 3 basis vectors with the highest eigenvalue as your RGB channels. You then have an RGB color for each of your patches and you interpolate?

Ill be able to be of more help once i get to that myself

WangYixuan12 commented 1 year ago

I followed the description from the paper and tested the example elephant images from the paper. The following code can segment out the foreground. It is kinda able to learn features similar to Fig. 1 from the paper, but not exactly. Feel free to play around with it! I am looking forward to more discussion!

import torch
import torch.nn.functional as F
import torchvision.transforms as T
import os
import cv2
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image

patch_h = 40
patch_w = 40
# feat_dim = 384 # vits14
# feat_dim = 768 # vitb14
# feat_dim = 1024 # vitl14
feat_dim = 1536 # vitg14

transform = T.Compose([
    T.GaussianBlur(9, sigma=(0.1, 2.0)),
    T.Resize((patch_h * 14, patch_w * 14)),
    T.CenterCrop((patch_h * 14, patch_w * 14)),
    T.ToTensor(),
    T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])

# dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
# dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
# dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')

print(dinov2_vitg14)

# extract features
features = torch.zeros(4, patch_h * patch_w, feat_dim)
imgs_tensor = torch.zeros(4, 3, patch_h * 14, patch_w * 14)
for i in range(4):
    img_path = f'dino_test_imgs/elephant_{i+1}.png'
    img = Image.open(img_path).convert('RGB')
    imgs_tensor[i] = transform(img)[:3]
with torch.no_grad():
    features_dict = dinov2_vitg14.forward_features(imgs_tensor)
    features = features_dict['x_norm_patchtokens']

# PCA for feature inferred
from sklearn.decomposition import PCA

features = features.reshape(4 * patch_h * patch_w, feat_dim)

pca = PCA(n_components=3)
pca.fit(features)
pca_features = pca.transform(features)

# visualize PCA components for finding a proper threshold
plt.subplot(1, 3, 1)
plt.hist(pca_features[:, 0])
plt.subplot(1, 3, 2)
plt.hist(pca_features[:, 1])
plt.subplot(1, 3, 3)
plt.hist(pca_features[:, 2])
plt.show()
plt.close()

# uncomment below to plot the first pca component
# pca_features[:, 0] = (pca_features[:, 0] - pca_features[:, 0].min()) / (pca_features[:, 0].max() - pca_features[:, 0].min())
# for i in range(4):
#     plt.subplot(2, 2, i+1)
#     plt.imshow(pca_features[i * patch_h * patch_w: (i+1) * patch_h * patch_w, 0].reshape(patch_h, patch_w))
# plt.show()
# plt.close()

# segment using the first component
pca_features_bg = pca_features[:, 0] < 10
pca_features_fg = ~pca_features_bg

# plot the pca_features_bg
for i in range(4):
    plt.subplot(2, 2, i+1)
    plt.imshow(pca_features_bg[i * patch_h * patch_w: (i+1) * patch_h * patch_w].reshape(patch_h, patch_w))
plt.show()

# PCA for only foreground patches
pca.fit(features[pca_features_fg]) # NOTE: I forgot to add it in my original answer
pca_features_rem = pca.transform(features[pca_features_fg])
for i in range(3):
    # pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].min()) / (pca_features_rem[:, i].max() - pca_features_rem[:, i].min())
    # transform using mean and std, I personally found this transformation gives a better visualization
    pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].mean()) / (pca_features_rem[:, i].std() ** 2) + 0.5

pca_features_rgb = pca_features.copy()
pca_features_rgb[pca_features_bg] = 0
pca_features_rgb[pca_features_fg] = pca_features_rem

pca_features_rgb = pca_features_rgb.reshape(4, patch_h, patch_w, 3)
for i in range(4):
    plt.subplot(2, 2, i+1)
    plt.imshow(pca_features_rgb[i][..., ::-1])
plt.savefig('features.png')
plt.show()
plt.close()
shoutOutYangJie commented 1 year ago

I havent gotten there to this point, but let me see if i can help So assuming you have 16 x 16 patches , and an embedding dimension of 1536 for each patch, your shape will be out = [256,1536].

Then do PCA

with resulting matrix if 1536x1536 which forms a basis for our embedding dimension. then sort the eigenvectors by eigenvalue, get a decomposition of your original 256x1536 embedding vector in terms of this basis,

Then this is where you have to figure out which components you want to keep and threshold, you could try: remove the basis vector with the highest eigenvalue and take the next 3 basis vectors with the highest eigenvalue as your RGB channels. You then have an RGB color for each of your patches and you interpolate?

Ill be able to be of more help once i get to that myself

so what is the "second PCA" refered by paper.

shoutOutYangJie commented 1 year ago

With the thresholded first component, you would be able to tell the foregroud from the background (in black). Then you would get the colors with the first three components. That is what I understand. Not tested by myself.

Paper

wow, How do you get this result?

shoutOutYangJie commented 1 year ago

I followed the description from the paper and tested the example elephant images from the paper. The following code can segment out the foreground. It is kinda able to learn features similar to Fig. 1 from the paper, but not exactly. Feel free to play around with it! I am looking forward to more discussion!

import torch
import torch.nn.functional as F
import torchvision.transforms as T
import os
import cv2
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image

patch_h = 40
patch_w = 40
# feat_dim = 384 # vits14
# feat_dim = 768 # vitb14
# feat_dim = 1024 # vitl14
feat_dim = 1536 # vitg14

transform = T.Compose([
    T.GaussianBlur(9, sigma=(0.1, 2.0)),
    T.Resize((patch_h * 14, patch_w * 14)),
    T.CenterCrop((patch_h * 14, patch_w * 14)),
    T.ToTensor(),
    T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])

# dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
# dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
# dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')

print(dinov2_vitg14)

# extract features
features = torch.zeros(4, patch_h * patch_w, feat_dim)
imgs_tensor = torch.zeros(4, 3, patch_h * 14, patch_w * 14)
for i in range(4):
    img_path = f'dino_test_imgs/elephant_{i+1}.png'
    img = Image.open(img_path).convert('RGB')
    imgs_tensor[i] = transform(img)[:3]
with torch.no_grad():
    features_dict = dinov2_vitg14.forward_features(imgs_tensor)
    features = features_dict['x_norm_patchtokens']

# PCA for feature inferred
from sklearn.decomposition import PCA

features = features.reshape(4 * patch_h * patch_w, feat_dim)

pca = PCA(n_components=3)
pca.fit(features)
pca_features = pca.transform(features)

# visualize PCA components for finding a proper threshold
plt.subplot(1, 3, 1)
plt.hist(pca_features[:, 0])
plt.subplot(1, 3, 2)
plt.hist(pca_features[:, 1])
plt.subplot(1, 3, 3)
plt.hist(pca_features[:, 2])
plt.show()
plt.close()

# uncomment below to plot the first pca component
# pca_features[:, 0] = (pca_features[:, 0] - pca_features[:, 0].min()) / (pca_features[:, 0].max() - pca_features[:, 0].min())
# for i in range(4):
#     plt.subplot(2, 2, i+1)
#     plt.imshow(pca_features[i * patch_h * patch_w: (i+1) * patch_h * patch_w, 0].reshape(patch_h, patch_w))
# plt.show()
# plt.close()

# segment using the first component
pca_features_bg = pca_features[:, 0] < 10
pca_features_fg = ~pca_features_bg

# plot the pca_features_bg
for i in range(4):
    plt.subplot(2, 2, i+1)
    plt.imshow(pca_features_bg[i * patch_h * patch_w: (i+1) * patch_h * patch_w].reshape(patch_h, patch_w))
plt.show()

# PCA for only foreground patches
pca_features_rem = pca.transform(features[pca_features_fg])
for i in range(3):
    # pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].min()) / (pca_features_rem[:, i].max() - pca_features_rem[:, i].min())
    # transform using mean and std, I personally found this transformation gives a better visualization
    pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].mean()) / (pca_features_rem[:, i].std() ** 2) + 0.5

pca_features_rgb = pca_features.copy()
pca_features_rgb[pca_features_bg] = 0
pca_features_rgb[pca_features_fg] = pca_features_rem

pca_features_rgb = pca_features_rgb.reshape(4, patch_h, patch_w, 3)
for i in range(4):
    plt.subplot(2, 2, i+1)
    plt.imshow(pca_features_rgb[i][..., ::-1])
plt.savefig('features.png')
plt.show()
plt.close()

I have used your code, I find the result still show litter background image

image

ccharest93 commented 1 year ago

Oh it seems first PCA is to mask background, then for the second PCA you need a batch of 3 related images (so 3 cats?), then you do PCA analysis across the 3 images to get a good signal for the components. Im assuming the 3 image have to be fairly different while having similiar object parts in them so that PCA can pick up on them

Good paper on cosegmentation using dinov1 for even better results Deep ViT Features as Dense Visual Descriptors

image

woctezuma commented 1 year ago

Wow, I overlooked this part! 👏

We compute a PCA between the patches of the images from the same column (a, b, c and d) and show their first 3 components. Each component is matched to a different color channel.

Column

WangYixuan12 commented 1 year ago

I havent gotten there to this point, but let me see if i can help So assuming you have 16 x 16 patches , and an embedding dimension of 1536 for each patch, your shape will be out = [256,1536]. Then do PCA with resulting matrix if 1536x1536 which forms a basis for our embedding dimension. then sort the eigenvectors by eigenvalue, get a decomposition of your original 256x1536 embedding vector in terms of this basis, Then this is where you have to figure out which components you want to keep and threshold, you could try: remove the basis vector with the highest eigenvalue and take the next 3 basis vectors with the highest eigenvalue as your RGB channels. You then have an RGB color for each of your patches and you interpolate? Ill be able to be of more help once i get to that myself

so what is the "second PCA" refered by paper.

As mentioned in P16 of the paper "We compute a second PCA on the remaining patches across three images depicting the same category." I think that means we need to do a second PCA on foreground patches.

ccharest93 commented 1 year ago

I still cant seem to get a good signal for thresholding the background with only the first component, when it comes to batches of image it is easier but even though the first component seems to have a good indication of depth it doesnt seem to have a threshold that consistently separates the foreground from the background like shown in the paper.

C1 the original image, C2 first PCA rescaled to grayscale C3 first PCA applied to groups of heads rescaled to RGB (R -> heads1-8, G-> heads-9-16,B->heads 17-24),

Screenshot 2023-04-21 112155

ive tryed other methods also but cant seem to get a consistent result, anyone has made any progress here? this is only foreground/background separation, i havent gotten to the second PCA yet for image part correspondance

TheoMoutakanni commented 1 year ago

I still cant seem to get a good signal for thresholding the background with only the first component, when it comes to batches of image it is easier but even though the first component seems to have a good indication of depth it doesnt seem to have a threshold that consistently separates the foreground from the background like shown in the paper.

C1 the original image, C2 first PCA rescaled to grayscale C3 first PCA applied to groups of heads rescaled to RGB (R -> heads1-8, G-> heads-9-16,B->heads 17-24),

Screenshot 2023-04-21 112155

ive tryed other methods also but cant seem to get a consistent result, anyone has made any progress here? this is only foreground/background separation, i havent gotten to the second PCA yet for image part correspondance

@ccharest93 You should use a clean image with a clear background/foreground separation

See my comment on this related issue: #45

ccharest93 commented 1 year ago

Progress! pil_image vit_large_desc

WangYixuan12 commented 1 year ago

Progress! pil_image vit_large_desc

Could you please share more details on how you obtained this result?

ccharest93 commented 1 year ago

Using the large model gave me better results than the giant model, i followed the procedure from the paper (2-step PCA) and i grab the outputs before the last normalization.

transforms.Compose([transforms.Resize(518, interpolation=transforms.InterpolationMode.LANCZOS),
 transforms.CenterCrop(518),
 transforms.ToTensor(),
 transforms.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225))])

is my input transformation. It's possible i changed other things in the reimplementation on my github but i dont think so. So far as the why vit_large works better than giant on noisy background i think it might have to why how sparse the embedding of the main object parts end up being compare to the background, im still doing analysis on that.

woctezuma commented 1 year ago

Coincidence?

References:

qiantubu commented 1 year ago

进展! pil_image vit_large_desc

excuse me,how was this effect achieved?

MartinBurian commented 1 year ago

This issue has been closed, but I just wanted to report that I managed to get fairly good reproductions using the vitl14_pretrain model in evaluation mode. fig_patch_pca 1) load images, resize to 448x448, transform values to fp32 in the interval (0,1), concatenate into one input tensor of shape [4,3,448,448] 2) image normalization: torchvision.transforms.Normalize(mean=0.5, std=0.2) (the parameters are arbitrary values, around what was mentioned in this thread) 3) run the model: result = model.forward_features(input_tensor) 3) get patch tokens from model output: patch_features = result['x_prenorm'][:,1:,:] (the normalized patch features seem to work similarly) 4) compute the first component of PCA of all the patches of all 4 images, scale the resulting features (projected_features = pca.fit_transform(patch_features; norm_features = sklearn.processing.minmax_scale(projected_patches)) 5) get foreground patches mask by thresholding the first PCA component 6) compute first 3 components of PCA of foreground patches of all the images 7) scale the PCA output as earlier 8) use directly as RGB values

I hope to get to publishing the whole notebook, but these were the crucial steps to the reproduction for me.

Edit: Just to make sure that the model generalizes well, I tested it on another image that was not a part of the training set. The results are just as good as on the training data. That suggests that you can indeed build a very useful model with just DINOv2, a linear classifier and 4 training examples (that are probably well represented in the unsupervised pre-training data).

woctezuma commented 1 year ago

Beautiful results! 🥳

RRoundTable commented 1 year ago

I create a demo with hugginface space. This is a demo made by referring to the above and advice.

https://huggingface.co/spaces/RoundtTble/dinov2-pca

image
mshooter commented 1 year ago

This issue has been closed, but I just wanted to report that I managed to get fairly good reproductions using the vitl14_pretrain model in evaluation mode. fig_patch_pca

  1. load images, resize to 448x448, transform values to fp32 in the interval (0,1), concatenate into one input tensor of shape [4,3,448,448]
  2. image normalization: torchvision.transforms.Normalize(mean=0.5, std=0.2) (the parameters are arbitrary values, around what was mentioned in this thread)
  3. run the model: result = model.forward_features(input_tensor)
  4. get patch tokens from model output: patch_features = result['x_prenorm'][:,1:,:] (the normalized patch features seem to work similarly)
  5. compute the first component of PCA of all the patches of all 4 images, scale the resulting features (projected_features = pca.fit_transform(patch_features; norm_features = sklearn.processing.minmax_scale(projected_patches))
  6. get foreground patches mask by thresholding the first PCA component
  7. compute first 3 components of PCA of foreground patches of all the images
  8. scale the PCA output as earlier
  9. use directly as RGB values

I hope to get to publishing the whole notebook, but these were the crucial steps to the reproduction for me.

Edit: Just to make sure that the model generalizes well, I tested it on another image that was not a part of the training set. The results are just as good as on the training data. That suggests that you can indeed build a very useful model with just DINOv2, a linear classifier and 4 training examples (that are probably well represented in the unsupervised pre-training data).

Could you publish the notebook?

MartinBurian commented 1 year ago

Here you go: https://github.com/MartinBurian/dinov2/blob/experiments/experiments/fg_segmantation.ipynb

I cleaned it up, but my environment broke (quite inexplicably :shrug:), so I did not have a chance to re-test it. I just hope it still works :crossed_fingers:

Cliffia123 commented 1 year ago

@MartinBurian Excellent! But I found some thing wrong in fg_segmantation.ipynb, This code :

all_patches = patch_tokens.reshape([-1,1024])

patch_tokens should reshaped into [-1, feature_channels], such 768 for ViT-B/14 / 1536 for ViT-g/14.

shoutOutYangJie commented 1 year ago

I create a demo with hugginface space. This is a demo made by referring to the above and advice.

https://huggingface.co/spaces/RoundtTble/dinov2-pca

image

wonderful,can you share your code?

dcbark01 commented 9 months ago

I create a demo with hugginface space. This is a demo made by referring to the above and advice. https://huggingface.co/spaces/RoundtTble/dinov2-pca

image

wonderful,can you share your code?

I'm sure you already figured this out, but for posterity's sake, the code is there, just click "files" in the upper right corner.

luccachiang commented 7 months ago

I found a repo that might be related to this topic: https://github.com/purnasai/Dino_V2.