facebookresearch / dino

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO
Apache License 2.0
6.25k stars 905 forks source link

Image resolution at intermediate layers? #252

Open yousafe007 opened 1 year ago

yousafe007 commented 1 year ago

As it is already clear, the images are resized to 224 by 224 before being fed into Dino. While doing work currently, where I use the intermediate layers' features, specifically layer 9, what is the image resolution at that layer, or any other layer for the sake of the question?

@mathildecaron31 Any help would be appreciated. :)

tcourat commented 7 months ago

This is a vision transformer, hence the image resolution is the same throught the whole network. There is not pooling layers like in CNN. However, each token corresponds to a patch size 8x8, hence the feature map resolution is 28x28.

yousafe007 commented 7 months ago

This is a vision transformer, hence the image resolution is the same throught the whole network. There is not pooling layers like in CNN. However, each token corresponds to a patch size 8x8, hence the feature map resolution is 28x28.

Perhaps my question was I'll-formulated. I meant the feature map, as you said. Could you tell me how you reached the number 28 through your calculation?

tcourat commented 7 months ago

The input image has size 224x224, hence you divide each dimension by 8 to obtain features maps of size 28x28. If you choose another patch size (different from 8x8), it may change.

If you look at the embeddings given by the model for one image, you get a tensor of shape (785,768). This is because 785=1+28*28 (there is a CLS token added in front of the 28x28=784 tokens of the feature map). 768 is the hidden dimension (at least with the vitb8 model).

If you want to obtain the "image-like" feature maps, you can get rid of the CLS token and reshape the tensor by e.g, assuming :

fmap = fmap[1:,:] # Keep every token except the first
fmap = fmap.reshape(28,28,768)

Above snipped code may change slightly if you deal with batched images (add a dimension for the batch then), or another patch size or hidden dimension depending on the model.

(Please note that I am not a creator of this github, I only provide what I understood from the architecture because I'm currently also digging into DINOV2)