facebookresearch / convit

Code for the Convolutional Vision Transformer (ConViT)
Apache License 2.0
459 stars 53 forks source link

Size mismatch on testing the model with pretrained weights. #10

Closed jdubpark closed 3 years ago

jdubpark commented 3 years ago

Hi, thanks for sharing the code! While I was testing it out with pretrained models on ImageNet21k, I encountered an error:

RuntimeError: Error(s) in loading state_dict for VisionTransformer:
        size mismatch for cls_token: copying a param with shape torch.Size([1, 1, 192]) from checkpoint, the shape in current model is torch.Size([1, 1, 768]).
        size mismatch for pos_embed: copying a param with shape torch.Size([1, 196, 192]) from checkpoint, the shape in current model is torch.Size([1, 196, 768]).
        size mismatch for patch_embed.proj.weight: copying a param with shape torch.Size([192, 3, 16, 16]) from checkpoint, the shape in current model is torch.Size([768, 3, 16, 16]).
.... (more size mismatch messages...)

Looks like the same error encountered by issue #4 (which was addressed by PR #6). But I'm facing the error on all the pretrained models (tiny, small, base). I think I might have found the issue:

On convit.py#L305: embed_dim *= num_heads was added as part of #6.

But on models.py#20: kwargs['embed_dim'] *= num_heads already multiplies theembed_dim bynum_heads (same operation for all models).

So it looks to me that convit.py#L305 is actually redundant. It's also causing the size error above. When I remove that line, the test works as expected.

sdascoli commented 3 years ago

Indeed, thanks a lot for your PR !