Closed shoutOutYangJie closed 1 year ago
Loosely related:
the quality is very different
That is because the user did not keep the first 3 components, etc. See the caption of the first figure in the article.
I don't know what "threshold the first componet and execute second PCA" means. Please help me @woctezuma
I try my best, But I can't remove the background.
I think you need to find a threshold for the first component. From what I understand, this removes the background. I have not checked how the threshold is computed for the figure in the paper, I guess automatically.
Then you can use the first three components to visualize the areas in RGB.
I have not tested, but that is my understanding of the figure.
I think you need to find a threshold for the first component. From what I understand, this removes the background. I have not checked how the threshold is computed for the figure in the paper, I guess automatically.
Then you can use the first three components to visualize the areas in RGB.
I have not tested, but that is my understanding of the figure.
If I want to first 3 componets, why need I find a threshold for first component? I am very confused. either One or three? if I remove the other componets which eigen value is less than the first compoent, there is only one component, rather three.
background is one of the major component, (part of the first one often) that is why the threshold needs to be applied there.
With the thresholded first component, you would be able to tell the foregroud from the background (in black). Then you would get the colors with the first three components. That is what I understand. Not tested by myself.
Let we discuss in details!
The token features extracted from network is X (shape is 256x1536, 256 comes from 16x16)
Firstly, we start to use PCA,
eigen_values, eigen_vectors = eig(X.T.dot(X))
the vector shape is 1536 x 1536. Secondly, I sort the eigen values to remove the first component, and filter whose value is less than 0 and I execute second PCA, directly decreasing dimension to 3 (RGB channel)
Is there something wrong?
@woctezuma @ccharest93
I havent gotten there to this point, but let me see if i can help So assuming you have 16 x 16 patches , and an embedding dimension of 1536 for each patch, your shape will be out = [256,1536].
Then do PCA
with resulting matrix if 1536x1536 which forms a basis for our embedding dimension. then sort the eigenvectors by eigenvalue, get a decomposition of your original 256x1536 embedding vector in terms of this basis,
Then this is where you have to figure out which components you want to keep and threshold, you could try: remove the basis vector with the highest eigenvalue and take the next 3 basis vectors with the highest eigenvalue as your RGB channels. You then have an RGB color for each of your patches and you interpolate?
Ill be able to be of more help once i get to that myself
I followed the description from the paper and tested the example elephant images from the paper. The following code can segment out the foreground. It is kinda able to learn features similar to Fig. 1 from the paper, but not exactly. Feel free to play around with it! I am looking forward to more discussion!
import torch
import torch.nn.functional as F
import torchvision.transforms as T
import os
import cv2
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
patch_h = 40
patch_w = 40
# feat_dim = 384 # vits14
# feat_dim = 768 # vitb14
# feat_dim = 1024 # vitl14
feat_dim = 1536 # vitg14
transform = T.Compose([
T.GaussianBlur(9, sigma=(0.1, 2.0)),
T.Resize((patch_h * 14, patch_w * 14)),
T.CenterCrop((patch_h * 14, patch_w * 14)),
T.ToTensor(),
T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)),
])
# dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
# dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
# dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')
print(dinov2_vitg14)
# extract features
features = torch.zeros(4, patch_h * patch_w, feat_dim)
imgs_tensor = torch.zeros(4, 3, patch_h * 14, patch_w * 14)
for i in range(4):
img_path = f'dino_test_imgs/elephant_{i+1}.png'
img = Image.open(img_path).convert('RGB')
imgs_tensor[i] = transform(img)[:3]
with torch.no_grad():
features_dict = dinov2_vitg14.forward_features(imgs_tensor)
features = features_dict['x_norm_patchtokens']
# PCA for feature inferred
from sklearn.decomposition import PCA
features = features.reshape(4 * patch_h * patch_w, feat_dim)
pca = PCA(n_components=3)
pca.fit(features)
pca_features = pca.transform(features)
# visualize PCA components for finding a proper threshold
plt.subplot(1, 3, 1)
plt.hist(pca_features[:, 0])
plt.subplot(1, 3, 2)
plt.hist(pca_features[:, 1])
plt.subplot(1, 3, 3)
plt.hist(pca_features[:, 2])
plt.show()
plt.close()
# uncomment below to plot the first pca component
# pca_features[:, 0] = (pca_features[:, 0] - pca_features[:, 0].min()) / (pca_features[:, 0].max() - pca_features[:, 0].min())
# for i in range(4):
# plt.subplot(2, 2, i+1)
# plt.imshow(pca_features[i * patch_h * patch_w: (i+1) * patch_h * patch_w, 0].reshape(patch_h, patch_w))
# plt.show()
# plt.close()
# segment using the first component
pca_features_bg = pca_features[:, 0] < 10
pca_features_fg = ~pca_features_bg
# plot the pca_features_bg
for i in range(4):
plt.subplot(2, 2, i+1)
plt.imshow(pca_features_bg[i * patch_h * patch_w: (i+1) * patch_h * patch_w].reshape(patch_h, patch_w))
plt.show()
# PCA for only foreground patches
pca.fit(features[pca_features_fg]) # NOTE: I forgot to add it in my original answer
pca_features_rem = pca.transform(features[pca_features_fg])
for i in range(3):
# pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].min()) / (pca_features_rem[:, i].max() - pca_features_rem[:, i].min())
# transform using mean and std, I personally found this transformation gives a better visualization
pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].mean()) / (pca_features_rem[:, i].std() ** 2) + 0.5
pca_features_rgb = pca_features.copy()
pca_features_rgb[pca_features_bg] = 0
pca_features_rgb[pca_features_fg] = pca_features_rem
pca_features_rgb = pca_features_rgb.reshape(4, patch_h, patch_w, 3)
for i in range(4):
plt.subplot(2, 2, i+1)
plt.imshow(pca_features_rgb[i][..., ::-1])
plt.savefig('features.png')
plt.show()
plt.close()
I havent gotten there to this point, but let me see if i can help So assuming you have 16 x 16 patches , and an embedding dimension of 1536 for each patch, your shape will be out = [256,1536].
Then do PCA
with resulting matrix if 1536x1536 which forms a basis for our embedding dimension. then sort the eigenvectors by eigenvalue, get a decomposition of your original 256x1536 embedding vector in terms of this basis,
Then this is where you have to figure out which components you want to keep and threshold, you could try: remove the basis vector with the highest eigenvalue and take the next 3 basis vectors with the highest eigenvalue as your RGB channels. You then have an RGB color for each of your patches and you interpolate?
Ill be able to be of more help once i get to that myself
so what is the "second PCA" refered by paper.
With the thresholded first component, you would be able to tell the foregroud from the background (in black). Then you would get the colors with the first three components. That is what I understand. Not tested by myself.
wow, How do you get this result?
I followed the description from the paper and tested the example elephant images from the paper. The following code can segment out the foreground. It is kinda able to learn features similar to Fig. 1 from the paper, but not exactly. Feel free to play around with it! I am looking forward to more discussion!
import torch import torch.nn.functional as F import torchvision.transforms as T import os import cv2 import matplotlib.pyplot as plt import numpy as np from PIL import Image patch_h = 40 patch_w = 40 # feat_dim = 384 # vits14 # feat_dim = 768 # vitb14 # feat_dim = 1024 # vitl14 feat_dim = 1536 # vitg14 transform = T.Compose([ T.GaussianBlur(9, sigma=(0.1, 2.0)), T.Resize((patch_h * 14, patch_w * 14)), T.CenterCrop((patch_h * 14, patch_w * 14)), T.ToTensor(), T.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)), ]) # dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14') # dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14') # dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14') dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14') print(dinov2_vitg14) # extract features features = torch.zeros(4, patch_h * patch_w, feat_dim) imgs_tensor = torch.zeros(4, 3, patch_h * 14, patch_w * 14) for i in range(4): img_path = f'dino_test_imgs/elephant_{i+1}.png' img = Image.open(img_path).convert('RGB') imgs_tensor[i] = transform(img)[:3] with torch.no_grad(): features_dict = dinov2_vitg14.forward_features(imgs_tensor) features = features_dict['x_norm_patchtokens'] # PCA for feature inferred from sklearn.decomposition import PCA features = features.reshape(4 * patch_h * patch_w, feat_dim) pca = PCA(n_components=3) pca.fit(features) pca_features = pca.transform(features) # visualize PCA components for finding a proper threshold plt.subplot(1, 3, 1) plt.hist(pca_features[:, 0]) plt.subplot(1, 3, 2) plt.hist(pca_features[:, 1]) plt.subplot(1, 3, 3) plt.hist(pca_features[:, 2]) plt.show() plt.close() # uncomment below to plot the first pca component # pca_features[:, 0] = (pca_features[:, 0] - pca_features[:, 0].min()) / (pca_features[:, 0].max() - pca_features[:, 0].min()) # for i in range(4): # plt.subplot(2, 2, i+1) # plt.imshow(pca_features[i * patch_h * patch_w: (i+1) * patch_h * patch_w, 0].reshape(patch_h, patch_w)) # plt.show() # plt.close() # segment using the first component pca_features_bg = pca_features[:, 0] < 10 pca_features_fg = ~pca_features_bg # plot the pca_features_bg for i in range(4): plt.subplot(2, 2, i+1) plt.imshow(pca_features_bg[i * patch_h * patch_w: (i+1) * patch_h * patch_w].reshape(patch_h, patch_w)) plt.show() # PCA for only foreground patches pca_features_rem = pca.transform(features[pca_features_fg]) for i in range(3): # pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].min()) / (pca_features_rem[:, i].max() - pca_features_rem[:, i].min()) # transform using mean and std, I personally found this transformation gives a better visualization pca_features_rem[:, i] = (pca_features_rem[:, i] - pca_features_rem[:, i].mean()) / (pca_features_rem[:, i].std() ** 2) + 0.5 pca_features_rgb = pca_features.copy() pca_features_rgb[pca_features_bg] = 0 pca_features_rgb[pca_features_fg] = pca_features_rem pca_features_rgb = pca_features_rgb.reshape(4, patch_h, patch_w, 3) for i in range(4): plt.subplot(2, 2, i+1) plt.imshow(pca_features_rgb[i][..., ::-1]) plt.savefig('features.png') plt.show() plt.close()
I have used your code, I find the result still show litter background
Oh it seems first PCA is to mask background, then for the second PCA you need a batch of 3 related images (so 3 cats?), then you do PCA analysis across the 3 images to get a good signal for the components. Im assuming the 3 image have to be fairly different while having similiar object parts in them so that PCA can pick up on them
Good paper on cosegmentation using dinov1 for even better results Deep ViT Features as Dense Visual Descriptors
Wow, I overlooked this part! 👏
We compute a PCA between the patches of the images from the same column (a, b, c and d) and show their first 3 components. Each component is matched to a different color channel.
I havent gotten there to this point, but let me see if i can help So assuming you have 16 x 16 patches , and an embedding dimension of 1536 for each patch, your shape will be out = [256,1536]. Then do PCA with resulting matrix if 1536x1536 which forms a basis for our embedding dimension. then sort the eigenvectors by eigenvalue, get a decomposition of your original 256x1536 embedding vector in terms of this basis, Then this is where you have to figure out which components you want to keep and threshold, you could try: remove the basis vector with the highest eigenvalue and take the next 3 basis vectors with the highest eigenvalue as your RGB channels. You then have an RGB color for each of your patches and you interpolate? Ill be able to be of more help once i get to that myself
so what is the "second PCA" refered by paper.
As mentioned in P16 of the paper "We compute a second PCA on the remaining patches across three images depicting the same category." I think that means we need to do a second PCA on foreground patches.
I still cant seem to get a good signal for thresholding the background with only the first component, when it comes to batches of image it is easier but even though the first component seems to have a good indication of depth it doesnt seem to have a threshold that consistently separates the foreground from the background like shown in the paper.
C1 the original image, C2 first PCA rescaled to grayscale C3 first PCA applied to groups of heads rescaled to RGB (R -> heads1-8, G-> heads-9-16,B->heads 17-24),
ive tryed other methods also but cant seem to get a consistent result, anyone has made any progress here? this is only foreground/background separation, i havent gotten to the second PCA yet for image part correspondance
I still cant seem to get a good signal for thresholding the background with only the first component, when it comes to batches of image it is easier but even though the first component seems to have a good indication of depth it doesnt seem to have a threshold that consistently separates the foreground from the background like shown in the paper.
C1 the original image, C2 first PCA rescaled to grayscale C3 first PCA applied to groups of heads rescaled to RGB (R -> heads1-8, G-> heads-9-16,B->heads 17-24),
ive tryed other methods also but cant seem to get a consistent result, anyone has made any progress here? this is only foreground/background separation, i havent gotten to the second PCA yet for image part correspondance
@ccharest93 You should use a clean image with a clear background/foreground separation
See my comment on this related issue: #45
Progress!
Progress!
Could you please share more details on how you obtained this result?
Using the large model gave me better results than the giant model, i followed the procedure from the paper (2-step PCA) and i grab the outputs before the last normalization.
transforms.Compose([transforms.Resize(518, interpolation=transforms.InterpolationMode.LANCZOS),
transforms.CenterCrop(518),
transforms.ToTensor(),
transforms.Normalize(mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225))])
is my input transformation. It's possible i changed other things in the reimplementation on my github but i dont think so. So far as the why vit_large works better than giant on noisy background i think it might have to why how sparse the embedding of the main object parts end up being compare to the background, im still doing analysis on that.
References:
进展!
excuse me,how was this effect achieved?
This issue has been closed, but I just wanted to report that I managed to get fairly good reproductions using the vitl14_pretrain model in evaluation mode.
1) load images, resize to 448x448, transform values to fp32 in the interval (0,1), concatenate into one input tensor of shape [4,3,448,448]
2) image normalization: torchvision.transforms.Normalize(mean=0.5, std=0.2)
(the parameters are arbitrary values, around what was mentioned in this thread)
3) run the model: result = model.forward_features(input_tensor)
3) get patch tokens from model output: patch_features = result['x_prenorm'][:,1:,:]
(the normalized patch features seem to work similarly)
4) compute the first component of PCA of all the patches of all 4 images, scale the resulting features (projected_features = pca.fit_transform(patch_features; norm_features = sklearn.processing.minmax_scale(projected_patches)
)
5) get foreground patches mask by thresholding the first PCA component
6) compute first 3 components of PCA of foreground patches of all the images
7) scale the PCA output as earlier
8) use directly as RGB values
I hope to get to publishing the whole notebook, but these were the crucial steps to the reproduction for me.
Edit: Just to make sure that the model generalizes well, I tested it on another image that was not a part of the training set. The results are just as good as on the training data. That suggests that you can indeed build a very useful model with just DINOv2, a linear classifier and 4 training examples (that are probably well represented in the unsupervised pre-training data).
Beautiful results! 🥳
I create a demo with hugginface space. This is a demo made by referring to the above and advice.
This issue has been closed, but I just wanted to report that I managed to get fairly good reproductions using the vitl14_pretrain model in evaluation mode.
- load images, resize to 448x448, transform values to fp32 in the interval (0,1), concatenate into one input tensor of shape [4,3,448,448]
- image normalization:
torchvision.transforms.Normalize(mean=0.5, std=0.2)
(the parameters are arbitrary values, around what was mentioned in this thread)- run the model:
result = model.forward_features(input_tensor)
- get patch tokens from model output:
patch_features = result['x_prenorm'][:,1:,:]
(the normalized patch features seem to work similarly)- compute the first component of PCA of all the patches of all 4 images, scale the resulting features (
projected_features = pca.fit_transform(patch_features; norm_features = sklearn.processing.minmax_scale(projected_patches)
)- get foreground patches mask by thresholding the first PCA component
- compute first 3 components of PCA of foreground patches of all the images
- scale the PCA output as earlier
- use directly as RGB values
I hope to get to publishing the whole notebook, but these were the crucial steps to the reproduction for me.
Edit: Just to make sure that the model generalizes well, I tested it on another image that was not a part of the training set. The results are just as good as on the training data. That suggests that you can indeed build a very useful model with just DINOv2, a linear classifier and 4 training examples (that are probably well represented in the unsupervised pre-training data).
Could you publish the notebook?
Here you go: https://github.com/MartinBurian/dinov2/blob/experiments/experiments/fg_segmantation.ipynb
I cleaned it up, but my environment broke (quite inexplicably :shrug:), so I did not have a chance to re-test it. I just hope it still works :crossed_fingers:
@MartinBurian Excellent! But I found some thing wrong in fg_segmantation.ipynb, This code :
all_patches = patch_tokens.reshape([-1,1024])
patch_tokens should reshaped into [-1, feature_channels], such 768 for ViT-B/14 / 1536 for ViT-g/14.
I create a demo with hugginface space. This is a demo made by referring to the above and advice.
wonderful,can you share your code?
I create a demo with hugginface space. This is a demo made by referring to the above and advice. https://huggingface.co/spaces/RoundtTble/dinov2-pca
wonderful,can you share your code?
I'm sure you already figured this out, but for posterity's sake, the code is there, just click "files" in the upper right corner.
I found a repo that might be related to this topic: https://github.com/purnasai/Dino_V2.
If you can share the code for visualizing, it must be a big&nice contribution