Open NielsRogge opened 1 year ago
I have a silly question, sorry to ask here. For the hidden_states, I want to convert (batch_size, num_image_patches, embedding_dim) to (batch_size, h, w, embedding_dim) for segmentation tasks. But I found for a (224, 224) image, the num_image_patches is 257 (not 16x16=256). What is the correct way to reshape it?
For torch.hub, it is a provided function.
encoder = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
hidden_states = encoder.get_intermediate_layers(pixel_values, out_indices, reshape=True)
-> (b, 1024, h/14, w/14)
And I browser the function in the repo, found it not that easy to adapt directly to the outputs.hidden_states for huggingface model.
Hi @Starlento great question! This is because DINOv2 (and vision transformers in general) typically also add a special CLS token before the sequence of image patches. Hence the sequence length becomes (image_size/patch_size)2 + 1. So in case you use a DINOv2 model with an image resolution of 224 and a patch size of 16, you get (224/16)2 + 1 = 257 embeddings out.
Hence one usually discards the final embedding of the CLS token, and only uses the embeddings of the image patches, as done here. I think it makes sense to add a Dinov2Backbone
class to the Transformers library, in similar spirit of other backbones. I've made a PR above for that.
Here's how you can use it (for now you'll need to do pip install git+https://github.com/nielsrogge/transformers.git@add_dinov2_backbone
):
from transformers import Dinov2Backbone
import torch
model = Dinov2Backbone.from_pretrained("facebook/dinov2-base", out_indices=[0,1,2,3])
pixel_values = torch.randn(1, 3, 224, 224)
outputs = model(pixel_values)
for feature_map in outputs.feature_maps:
print(feature_map.shape)
By default, feature maps will be 4D i.e. of shape (batch_size, num_channels, height, width). If you want 3D feature maps, just pass in reshape=False
to the from_pretrained
method.
I've created a tutorial notebook on training a linear classifier using DINOv2's frozen features for semantic segmentation. The notebook would be very similar for image classification or depth estimation.
Hi,
Thank you a lot for the tutorial and the hf version. I ran it on colab. I have two questions:
Error below:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
[<ipython-input-37-433acb205ffe>](https://localhost:8080/#) in <cell line: 1>()
----> 1 model = Dinov2ForSemanticSegmentation.from_pretrained("facebook/dinov2-base", id2label=id2label, num_labels=len(id2label))
1 frames
[<ipython-input-36-e8c7a63af851>](https://localhost:8080/#) in __init__(self, config)
23 super().__init__(config)
24
---> 25 self.dinov2 = Dinov2Model(config, add_pooling_layer=False)
26 self.classifier = LinearClassifier(config.hidden_size, 32, 32, config.num_labels)
27
TypeError: Dinov2Model.__init__() got an unexpected keyword argument 'add_pooling_layer'
ValueError Traceback (most recent call last)
8 frames /usr/local/lib/python3.10/dist-packages/albumentations/core/composition.py in _check_args(self, **kwargs) 284 285 if self.is_check_shapes and shapes and shapes.count(shapes[0]) != len(shapes): --> 286 raise ValueError( 287 "Height and Width of image, mask or masks should be equal. You can disable shapes check " 288 "by setting a parameter is_check_shapes=False of Compose class (do it only if you are sure "
ValueError: Height and Width of image, mask or masks should be equal. You can disable shapes check by setting a parameter is_check_shapes=False of Compose class (do it only if you are sure about your data consistency).
I could comment out is_check_shapes but im thinking that this would affect some images from being converted as the model requires them
@rainbowpuffpuff thanks for reporting, I've removed add_pooling_layer
recently, so no need to pass that. I've updated my notebook.
regarding the second question, looks like Albumentations says there's an image which has a segmentation mask with a different shape, weird, haven't encountered that. Will rerun the notebook to verify
Hi @Starlento great question! This is because DINOv2 (and vision transformers in general) typically also add a special CLS token before the sequence of image patches. Hence the sequence length becomes (image_size/patch_size)2 + 1. So in case you use a DINOv2 model with an image resolution of 224 and a patch size of 16, you get (224/16)2 + 1 = 257 embeddings out.
Hence one usually discards the final embedding of the CLS token, and only uses the embeddings of the image patches, as done here. I think it makes sense to add a
Dinov2Backbone
class to the Transformers library, in similar spirit of other backbones. I've made a PR above for that.Here's how you can use it (for now you'll need to do
pip install git+https://github.com/nielsrogge/transformers.git@add_dinov2_backbone
):from transformers import Dinov2Backbone import torch model = Dinov2Backbone.from_pretrained("facebook/dinov2-base", out_indices=[0,1,2,3]) pixel_values = torch.randn(1, 3, 224, 224) outputs = model(pixel_values) for feature_map in outputs.feature_maps: print(feature_map.shape)
By default, feature maps will be 4D i.e. of shape (batch_size, num_channels, height, width). If you want 3D feature maps, just pass in
reshape=False
to thefrom_pretrained
method.
I found the reshape is somewhat wrong?
from transformers import Dinov2Backbone
import torch
encoder = Dinov2Backbone.from_pretrained("hf-base-models/facebook_dinov2-large", out_features=["stage6", "stage12", "stage18", "stage24"])
picked_hidden_states = encoder(torch.rand(1, 3, 448, 224)).feature_maps
for x in picked_hidden_states:
print(x.shape)
torch.Size([1, 1024, 16, 32])
torch.Size([1, 1024, 16, 32])
torch.Size([1, 1024, 16, 32])
torch.Size([1, 1024, 16, 32])
I used to use only square images that I did not find the problem... The problem is might be
for stage, hidden_state in zip(self.stage_names, hidden_states):
if stage in self.out_features:
if self.config.apply_layernorm:
hidden_state = self.layernorm(hidden_state)
if self.config.reshape_hidden_states:
batch_size, _, height, width = pixel_values.shape
patch_size = self.config.patch_size
hidden_state = hidden_state[:, 1:, :].reshape(
batch_size, width // patch_size, height // patch_size, -1
)
hidden_state = hidden_state.permute(0, 3, 1, 2).contiguous()
feature_maps += (hidden_state,)
hidden_state = hidden_state[:, 1:, :].reshape(
batch_size, width // patch_size, height // patch_size, -1
)
should place height in front of width?
Hi @Starlento,
That's indeed a bug in the original implementation, I'm addressed it in https://github.com/huggingface/transformers/pull/26092
does anybody see an example of using Mask2Former as a head?
is_check_shapes
@rainbowpuffpuff thanks for reporting, I've removed
add_pooling_layer
recently, so no need to pass that. I've updated my notebook.regarding the second question, looks like Albumentations says there's an image which has a segmentation mask with a different shape, weird, haven't encountered that. Will rerun the notebook to verify
Looks like at least one of the images in the dataset is transposed. As a quick hack, In the SegmentationDataset class, I added the following and with it, I could train: if original_image.shape[:2] != original_segmentation_map.shape: original_image = np.transpose(original_image, (1,0,2)) print("Transposed and continuing") print("Original image "+str(original_image.shape))
Hi folks,
As there are multiple issues here regarding fine-tuning DINOv2 on custom data, questions related to semantic segmentation/depth estimation, image similarity and feature extraction etc. this should now become easier given the model is available in HF Transformers. Check below for tips and tricks.
Documentation: https://huggingface.co/docs/transformers/main/model_doc/dinov2
The checkpoints are on the hub: https://huggingface.co/models?other=dinov2
I've created a tutorial notebook on training a linear classifier using DINOv2's frozen features for semantic segmentation. The notebook would be very similar for image classification or depth estimation.
Semantic segmentation/image classification/depth estimation
Refer to my demo notebook here: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DINOv2. One just places a linear classifier on top of the model, and uses the features as-is.
Depth estimation:
* DPT + DINOv2 is now supported, a notebook has been made available [here](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/DPT/Inference_with_DPT_%2B_DINOv2_for_depth_estimation.ipynb). * For fine-tuning, one would however use a different loss function, like [this one](https://github.com/huggingface/transformers/blob/e42587f596181396e1c4b63660abf0c736b10dae/src/transformers/models/glpn/modeling_glpn.py#L765-L766) used in the [GLPN model](https://huggingface.co/docs/transformers/model_doc/glpn#transformers.GLPNForDepthEstimation) to predict the loss between logits and ground truth depth maps.
Image classification: here the linear classifier can look simpler, you can just use
Dinov2ForImageClassification
. Refer to this notebook or example scripts.Feature extraction
Feature extraction is also very simple:
from transformers import AutoImageProcessor, AutoModel from PIL import Image import requests url = 'http://images.cocodataset.org/val2017/000000039769.jpg' image = Image.open(requests.get(url, stream=True).raw) processor = AutoImageProcessor.from_pretrained('facebook/dinov2-base') model = AutoModel.from_pretrained('facebook/dinov2-base') inputs = processor(images=image, return_tensors="pt") outputs = model(**inputs) features = outputs.last_hidden_state
The
features
in this case will be a PyTorch tensor of shape (batch_size, num_image_patches, embedding_dim). So one can turn them into a single vector by averaging over the image patches, like so:features = features.mean(dim=1)
Now you have a single 768-dim (or other sizes, depending on which one you are using) vector for each image in your batch.
Getting intermediate features
Intermediate features can be easily obtained by passing
output_hidden_states=True
to the forward method of the code snippet above. Theoutputs
will then contain an additional key calledhidden_states
, which contain the intermediate features for each of the Transformer layers.Image similarity
We have a tutorial on that here: https://huggingface.co/blog/image-similarity. Given that DINOv2 now is available in HF Transformers, one can simply replace the model_ckpt in the blog with the ones of DINOv2 on the 🤗 hub.
Can be relevant for #6, #14, #15, #25, #46, #47, #54, #55, #80, #84, #97, #99
Have fun fine-tuning them!
Cheers,
Niels
Hi @NielsRogge , thanks for the detailed information. Then, according to your explanation, doing:
outputs = model(**inputs)
outputs.last_hidden_state
should be exactly the same as doing:
outputs = model(**inputs, output_hidden_states=True)
outputs.hidden_states[-1]
In my case I find that these 2 chunks of code return completely different tensors, so I am not sure which one corresponds to the final CLS and patch embeddings at the end of the Transformer. If you could clarify it would be great.
Thanks!
In my case I find that these 2 chunks of code return completely different tensors, so I am not sure which one corresponds to the final CLS and patch embeddings at the end of the Transformer. If you could clarify it would be great.
Yes that's because there's layernorm applied on the last hidden states as seen here.
In my case I find that these 2 chunks of code return completely different tensors, so I am not sure which one corresponds to the final CLS and patch embeddings at the end of the Transformer. If you could clarify it would be great.
Yes that's because there's layernorm applied on the last hidden states as seen here.
Thanks!).
Thanks for the answer. So when using DinoV2 as a feature extractor for images, is it better to take the embeddings after applying LayerNorm or before?
Thanks!
It's mostly a matter of experimentation, I would just try out both and see which ones work best.
@NielsRogge, just for the sake of clarify, to account for the output tensor of the second dimension (image_size / patch_size)^2 + 1
, from what I read and understood from the model card, the model image patch size is of 14 pixels and not 16 pixels as you mentioned here, so that with an image resolution of 224 and a patch size of 14, you get (224 / 14)^2 + 1 = 257
embeddings out. Thank you so much for your work! ;)
Could you reproduce any of the paper's results?
@franchesoni I ported the weigths to the HF format, to reproduce the results I'd recommend the scripts present in the original repository. We do have image classification scripts here: https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification, but they are mostly for demo purposes, and need to be tweaked for a specific use case.
@NielsRogge thank you for your amazing contribution! please can you tell me more about the "rescale_factor" parameter, why do we need it and why does it have to take on a value of 0.00392156862745098? I did not find the corresponding piece of code from the official repo of dinov2, can you point it out?
Hi @zhaoyanpeng that value comes from 1/255. Typically the red, green and blue color channels of images have values between 0 and 255. Neural networks on the other hand are typically trained on numbers between 0 and 1. So rescaling is a kind of standardization step. The original repository uses ToTensor from torchvision which does the same thing.
Hi @zhaoyanpeng that value comes from 1/255. Typically the red, green and blue color channels of images have values between 0 and 255. Neural networks on the other hand are typically trained on numbers between 0 and 1. So rescaling is a kind of standardization step. The original repository uses ToTensor from torchvision which does the same thing.
Ah, it now makes sense. thnk you for your prompt reply!
Hi, does any have such issue like this? key error dinov2
? It's with transformer==4.30.2
and timm==0.9.12
All downloaded from https://hf-mirror.com/facebook/dinov2-base/tree/main
DINOv2 was probably added in a later version of Transformers, so pip install --upgrade transformers
will fix that.
@NielsRogge thank you for the amazing works, I have a question regarding image similarity calculation, should I take the mean value of the last_hidden_state for each image in total of 2 images to compute
emb_img1, emb_img2 = last_hidden_states[0].mean(dim=0), last_hidden_states[1].mean(dim=0)
metric = F.cosine_similarity(emb_img1, emb_img2, dim=0)
or
emb_img1, emb_img2 = last_hidden_states[0, 0], last_hidden_states[1, 0] # Get cls token (0-th token) for each img
the second line of code successfully replicated the result of this paper: https://openaccess.thecvf.com/content/CVPR2023/supplemental/Ruiz_DreamBooth_Fine_Tuning_CVPR_2023_supplemental.pdf but the first line failed.
Hi,
It depends a bit, some models have a CLS token which is specifically trained in a contrastive way like CLIP or SigLIP, so it's advised to use that. Other models work better by average pooling the final hidden state of the patch tokens.
So I would try both approaches and see which one works best.
Hi @NielsRogge What would be the necessary steps to use DINOv2 with ViT Adapter? What would be the easiest way to achieve this?
Hi folks,
As there are multiple issues here regarding fine-tuning DINOv2 on custom data, questions related to semantic segmentation/depth estimation, image similarity and feature extraction etc. this should now become easier given the model is available in HF Transformers. Check below for tips and tricks.
Documentation: https://huggingface.co/docs/transformers/main/model_doc/dinov2
The checkpoints are on the hub: https://huggingface.co/models?other=dinov2
I've created a tutorial notebook on training a linear classifier using DINOv2's frozen features for semantic segmentation. The notebook would be very similar for image classification or depth estimation.
Image classification
For image classification, the easiest is to use the
Dinov2ForImageClassification
class available in the Transformers library. You can then just follow the example notebooks or example scripts.Semantic segmentation
Refer to my demo notebook here: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/DINOv2. One just places a linear classifier on top of the model, and uses the features as-is.
Depth estimation:
Feature extraction
Feature extraction is also very simple:
The
features
in this case will be a PyTorch tensor of shape (batch_size, num_image_patches, embedding_dim). So one can turn them into a single vector by averaging over the image patches, like so:Now you have a single 768-dim (or other sizes, depending on which one you are using) vector for each image in your batch.
Getting intermediate features
Intermediate features can be obtained in 2 ways:
output_hidden_states=True
to the forward method of the code snippet above. Theoutputs
will then contain an additional key calledhidden_states
, which contain the intermediate features for each of the Transformer layers.Dinov2Backbone
class.Image similarity
We have a tutorial on that here: https://huggingface.co/blog/image-similarity. Given that DINOv2 now is available in HF Transformers, one can simply replace the model_ckpt in the blog with the ones of DINOv2 on the 🤗 hub.
Can be relevant for #6, #14, #15, #25, #46, #47, #54, #55, #80, #84, #97, #99
Have fun fine-tuning them!
Cheers,
Niels