facebookresearch / dino

PyTorch code for Vision Transformers training with the Self-Supervised learning method DINO
Apache License 2.0
6.3k stars 908 forks source link

`interpolate_pos_encoding(x, pos_embed)` doesnt return correct dimension for images that is not square (w != h) #8

Closed enverfakhan closed 3 years ago

enverfakhan commented 3 years ago

I notice the generation of positional embedding in interpolate_pos_encoding method is slightly different than the one in the forward_selfattention method. The following simple modification bring both into the same page, to your interest.

    def interpolate_pos_encoding(self, x, pos_embed, w, h):  # passing w and h as arguments
        npatch = x.shape[1] - 1
        N = pos_embed.shape[1] - 1
        if npatch == N:
            return pos_embed
        class_emb = pos_embed[:, 0]
        pos_embed = pos_embed[:, 1:]
        dim = x.shape[-1]
        w0 = w // self.patch_embed.patch_size  # just copy paste from forward_selfattention
        h0 = h // self.patch_embed.patch_size
        pos_embed = nn.functional.interpolate(
            pos_embed.reshape(1, int(math.sqrt(N)), int(math.sqrt(N)), dim).permute(0, 3, 1, 2),
            scale_factor=(w0 / math.sqrt(N), h0 / math.sqrt(N)),  # replace math.sqrt(npatch / N) with one from forward_selfattention
            mode='bicubic',
        )
        pos_embed = pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)
        return torch.cat((class_emb.unsqueeze(0), pos_embed), dim=1)
enverfakhan commented 3 years ago

I realized this solution is prone to floating point error, an example for such an error would be the following

npatch = 3904
N = 784
w, h = 491, 532
self.patch_embed.patch_size = 8
w0 = w // self.patch_embed.patch_size  # 61
h0 = h // self.patch_embed.patch_size   # 64
pos_embed = nn.functional.interpolate(
            pos_embed.reshape(1, int(math.sqrt(N)), int(math.sqrt(N)), dim).permute(0, 3, 1, 2),
            scale_factor=(w0 / math.sqrt(N), h0 / math.sqrt(N)),  
            mode='bicubic',
        )
pos_embed = pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)  # pos_embed.shape[1] is 3840 (60 * 64) which is supposed to be 3904 (61 * 64)

This happens because w0 / math.sqrt(N) * math.sqrt(N) is not equal w0 and nn.functional.interpolate casts int on the scale_factor (I assume) and w0 / math.sqrt(N) * math.sqrt(N) is something like 60.999999999.. and it become 60 when int is casted on. Now a solution for this problem would be adding a small number (0.1) to w0 and h0.

However this brings another question, does interpolating the positional embedings makes sense?, I mean does this operation happen during training or all the images in the training are just in the right shape and interpolation never occur? Because in that case interpolating positional embeddings (learned positional embedding) would be illegitimate right?

mathildecaron31 commented 3 years ago

Hi @enverfakhan

Thanks for raising this issue. Yes I totally agree that something could be done to simplify/unify a bit the code there...

For the floating point error I've found that workaround: https://github.com/facebookresearch/dino/blob/1d06521adfc53c80dece1a74902f718873e9821d/vision_transformer.py#L235-L240

I mean does this operation happen during training or all the images in the training are just in the right shape and interpolation never occur?

This operation actually happens during training. Indeed, with multi-crop the model is trained both with images of 224x224 and images of 96x96. So when forwarding a batch of 96^2 images we need to interpolate the encodings. As a matter of fact in my experiments I also tried having two sets of encodings. In practice that means that I was using differents encodings for the 224x224 and for the 96x96 inputs. This solution has exactly the same performance as when performing bicubic interpolation, which makes me think that the interpolation solution makes sense.

enverfakhan commented 3 years ago

Hi @mathildecaron31 thanks for the response and also I appreciate the insight about the interpolation vs separate pos_embed a lot. I would be curios about how would that behave in the wild with completely different sizes. I actually tried the deit_small(patch_size=8) for retrieval task on a in-house data, it seems to be working on par with a supervised vgg imagenet, however I had to set the image sizes to [224, 224] because some of the images blow the memory during the attention computation.

About the workaround for the floating point error, I feel like incrementing the w0 and h0 a small amount is more legit than zero padding the pos_embed but it is probably not a big deal especially if the image size is relatively big.

Looking forward for the Dino on large, random, uncurated dataset.

mathildecaron31 commented 3 years ago

I actually tried the deit_small(patch_size=8) for retrieval task on a in-house data, it seems to be working on par with a supervised vgg imagenet,

That's slightly disappointing :/. Have you tried the other models ? For example ViT-Base/16 should be more manageable memorywise. As a matter of fact, on copy detection datasets, I've found the base models to perform clearly better than the small ones: I get better performance with Base16x16 than with Small8x8 though Small8x8 is better at k-NN ImNet.

About the workaround for the floating point error, I feel like incrementing the w0 and h0 a small amount is more legit than zero padding the pos_embed but it is probably not a big deal especially if the image size is relatively big.

Yes your solution is definitely better ! I'll update that in the code.

enverfakhan commented 3 years ago

I picked the Small 8x8 because it was shown that that performs better with k-NN ImNet and because I was going to try with zero shot for retrieval task this choice mad more sense at the time. The result was indeed slightly disappointing, however I haven't experimented exhaustively and I dont have quantitative result either, I only check qualitatively which you can only do it for a handful of query, so this result is not definitive at all. But I should say the in-house data is very different than the imagenet, so I wouldn't be very surprised if I got some weird result with either model.

KeremTurgutlu commented 3 years ago

But I should say the in-house data is very different than the imagenet, so I wouldn't be very surprised if I got some weird result with either model.

I wonder if finetuning DINO models on the in-house data you have might help? But you mentioned, that results are on par with vgg pretrained on imagenet so I am not very sure. Probably still worth trying.

enverfakhan commented 3 years ago

I actually tried the deit_small(patch_size=8) for retrieval task on a in-house data, it seems to be working on par with a supervised vgg imagenet

I guess I caused a misinformation unintentionally. The images were RGBA and I was treating them as RGB. I'm sorry if I caused any confusion. However after I accounted for that image format, the result still varies from query (image) to query. In some cases Dino outperforms a vgg_16 ImNet by far, but in some other cases they are almost on par or even worse. I haven't detected a consistent pattern for which images that Dino outperform or under-perform, but so far it seems like, anecdotally, Dino outperform in colorful images and it is on par (or worse) with vgg_16 ImNet for black and white images. By the way, I'm testing OpenAI's clip model with ViT too and Clip model seems to be the worst among the three (I was betting on the clip model that it would be the best, but couldn't be more wrong :) )

it's hard to come up with a quantitative evaluation. The images are multi--tagged and we are trying to retrieve similar images given a query. The tags are not reliable for evaluation because some similar images doesn't share any tag, or reverse is also the case, different images may share some common tag (instagram logo vs instagram app image). I'm planning to do a finetunining as multi class classification and try to get some numeric assessment out of that.

I wonder if finetuning DINO models on the in-house data you have might help?

I strongly believe it would help, but I wonder which model would perform better after finetuning with each. However the preliminary result that Dino is working better on colorful images is worth to pay attention.

woctezuma commented 3 years ago

By the way, I'm testing OpenAI's clip model with ViT too and Clip model seems to be the worst among the three (I was betting on the clip model that it would be the best, but couldn't be more wrong :) )

Not surprised. :D cf. https://github.com/openai/CLIP/issues/1 and https://openai.com/blog/multimodal-neurons/

enverfakhan commented 3 years ago

Thank you for the reference to the issue, it was super fun to read and check the examples (and of course eye opening :) ). So the Clip model is out of option.

KeremTurgutlu commented 3 years ago

@mathildecaron31 I have a question about copy detection. I am trying to evaluate the pretrained DINO models on a dataset for copy detection task and I am trying to follow the steps from the paper. Even with different image input sizes in Table 4 we see that final embedding dimension is 1536. I am not able to understand how we can get same embedding dimension after concatenating CLS embedding and GeM pooled output patch tokens for different input image sizes. Maybe I am missing a point here. Here is what I did:

Added the following method to VisionTransformer to return output patch tokens and cls output.

def forward_output_patch_tokens_cls(self, x):
        B = x.shape[0]
        x = self.patch_embed(x)

        cls_tokens = self.cls_token.expand(B, -1, -1)
        x = torch.cat((cls_tokens, x), dim=1)
        pos_embed = self.interpolate_pos_encoding(x, self.pos_embed)
        x = x + pos_embed
        x = self.pos_drop(x)

        for blk in self.blocks:
            x = blk(x)
        if self.norm is not None:
            x = self.norm(x)

        return x

Using GeM module from here

def gem(x, p=3, eps=1e-6):
    "x: BS x num tokens x embed_dim"
    return F.avg_pool1d(x.clamp(min=eps).pow(p), (x.size(-1))).pow(1./p)

class GeM(nn.Module):

    def __init__(self, p=3, eps=1e-6):
        super(GeM,self).__init__()
        self.p = nn.Parameter(torch.ones(1)*p)
        self.eps = eps

    def forward(self, x):
        return gem(x, p=self.p, eps=self.eps)

    def __repr__(self):
        return self.__class__.__name__ + '(' + 'p=' + '{:.4f}'.format(self.p.data.tolist()[0]) + ', ' + 'eps=' + str(self.eps) + ')'

Collect embeddings (CLS + GeM Pooled Output Patch Tokens)

all_image_features = []
with torch.no_grad():
    for imgb in progress_bar(image_dl):
        outputs = model.forward_output_patch_tokens_cls(imgb.cuda())
        cls_token, output_patch_tokens = outputs[:,0],outputs[:,1:]

        cls_features   = cls_token   
        patch_features = gem_pooling(output_patch_tokens.permute(0,2,1)).squeeze(-1)
        concat_features = torch.cat([cls_features,patch_features],dim=-1)
        all_image_features.append(concat_features.cpu())

Following this and using an image size of 224 for dino_vitb8 my final embedding dimension is 1568 1536. Which can also be calculated as:

cls_feature_dim*2 = 768*2

Question Also, during copy detection task do you learn the pooling parameter p or is it picked based on validation set? I didn't quite understand the whitening part is it same as regular unsupervised PCA?

Found this paper: https://hal.inria.fr/hal-00722622v2/document. I believe idea is coming from here.

Edit:

Figured out the 1536 dimension size. We need to pool across token positions, so this gives pooled embedding with same dimension as cls token embedding dimension.

mathildecaron31 commented 3 years ago

Hi @KeremTurgutlu , let me open a new issue :)

@enverfakhan I have incorporated your suggested fix for the floating point error and have also been trying to improve the forward logic in the vision_transformer.py code. Thanks a lot for your suggestion and feedback is appreciated if you do have some time :). https://github.com/facebookresearch/dino/blob/6687929d7cdc2e7a5150f6e24c2b6713293944ac/vision_transformer.py#L174-L233

I'm closing this issue. Feel free to reopen is there is other problem related to the interpolation of the positional encodings.

NightMachinery commented 1 year ago

@mathildecaron31

This operation actually happens during training. Indeed, with multi-crop the model is trained both with images of 224x224 and images of 96x96. So when forwarding a batch of 96^2 images we need to interpolate the encodings. As a matter of fact in my experiments I also tried having two sets of encodings. In practice that means that I was using differents encodings for the 224x224 and for the 96x96 inputs. This solution has exactly the same performance as when performing bicubic interpolation, which makes me think that the interpolation solution makes sense.

patch_pos_embed = nn.functional.interpolate(
    patch_pos_embed.reshape(1, int(math.sqrt(N)), int(math.sqrt(N)), dim).permute(
        0, 3, 1, 2
    ),
    scale_factor=(w0 / math.sqrt(N), h0 / math.sqrt(N)),
    mode="bicubic",
)

So when a 96x96 cropped image is fed to the model during its training, the positional embeddings of the original 224x224 model get scaled down to 96x96? (assuming a patch size of 8, pos embeddings of size 392x392 will get downsized to 72x72)

Couldn't we just use a subset of the original embeddings for the smaller images? Like in text models, we can give the model smaller sequences with no problems. Again assuming a patch size of 8, this means gettings the first 72x72 subset of the whole 392x392 positional embeddings.

I guess the current interpolation regime will make model more invariant to the scale of the images ...

alexaatm commented 10 months ago

Hi! I stumbled on the same issue when using dinov2, the code crashed on the same function when using rectangular input...

In the function to encode positions, this github issue was referenced:

def interpolate_pos_encoding(self, x, w, h):
          previous_dtype = x.dtype
          npatch = x.shape[1] - 1
          N = self.pos_embed.shape[1] - 1
          if npatch == N and w == h:
              return self.pos_embed
          pos_embed = self.pos_embed.float()
          class_pos_embed = pos_embed[:, 0]
          patch_pos_embed = pos_embed[:, 1:]
          dim = x.shape[-1]
          w0 = w // self.patch_size
          h0 = h // self.patch_size
          print(f'DEBUG dinov2 vision_trasnformer.py: w0={w0}, h0={h0}')
          # we add a small number to avoid floating point error in the interpolation
          # see discussion at https://github.com/facebookresearch/dino/issues/8
          w0, h0 = w0 + self.interpolate_offset, h0 + self.interpolate_offset
          print(f'DEBUG dinov2 vision_trasnformer.py: add small number w0={w0}, h0={h0}')

          sqrt_N = math.sqrt(N)
          sx, sy = float(w0) / sqrt_N, float(h0) / sqrt_N
          patch_pos_embed = nn.functional.interpolate(
              patch_pos_embed.reshape(1, int(sqrt_N), int(sqrt_N), dim).permute(0, 3, 1, 2),
              scale_factor=(sx, sy),
              mode="bicubic",
              antialias=self.interpolate_antialias,
          )
          print(f'DEBUG dinov2 vision_trasnformer.py: patch_pos_embed.shape={patch_pos_embed.shape}')

          assert int(w0) == patch_pos_embed.shape[-2]
          assert int(h0) == patch_pos_embed.shape[-1]
          patch_pos_embed = patch_pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)
          return torch.cat((class_pos_embed.unsqueeze(0), patch_pos_embed), dim=1).to(previous_dtype)

I checked the output by feeding rectangular image and found out the small addition did not change the w and h, see output:

 image_crop.shape=torch.Size([1, 3, 434, 546])
DEBUG dinov2 vision_trasnformer.py: w0=31, h0=39
DEBUG dinov2 vision_trasnformer.py: add small number w0=31.0, h0=39.0
DEBUG dinov2 vision_trasnformer.py: patch_pos_embed.shape=torch.Size([1, 384, 31, 38])

Note that in the init, interpolate_offset=0.1. Here are the errors I got:

File "home/.cache/torch/hub/facebookresearch_dinov2_main/dinov2/models/vision_transformer.py", line 204, in interpolate_pos_encoding
    assert int(w0) == patch_pos_embed.shape[-2]
AssertionError

Note: used pretrained dinov2_vits14_reg model.

steve-landers commented 9 months ago

Is there a reason the interpolate call doesn't set the output size directly to (w0,h0) using the size parameter, rather than using the scale_factor parameter?

zshn25 commented 5 months ago

Is there a reason the interpolate call doesn't set the output size directly to (w0,h0) using the size parameter, rather than using the scale_factor parameter?

+1

Why cannot we just do

patch_pos_embed = nn.functional.interpolate(
            patch_pos_embed.reshape(1, h0, w0, dim).permute(0, 3, 1, 2),
            mode="bicubic",
            antialias=self.interpolate_antialias,
            **kwargs,
        )