Hi, thanks for sharing the code! While I was testing it out with pretrained models on ImageNet21k, I encountered an error:
RuntimeError: Error(s) in loading state_dict for VisionTransformer:
size mismatch for cls_token: copying a param with shape torch.Size([1, 1, 192]) from checkpoint, the shape in current model is torch.Size([1, 1, 768]).
size mismatch for pos_embed: copying a param with shape torch.Size([1, 196, 192]) from checkpoint, the shape in current model is torch.Size([1, 196, 768]).
size mismatch for patch_embed.proj.weight: copying a param with shape torch.Size([192, 3, 16, 16]) from checkpoint, the shape in current model is torch.Size([768, 3, 16, 16]).
.... (more size mismatch messages...)
Looks like the same error encountered by issue #4 (which was addressed by PR #6). But I'm facing the error on all the pretrained models (tiny, small, base). I think I might have found the issue:
On convit.py#L305:
embed_dim *= num_heads was added as part of #6.
But on models.py#20:
kwargs['embed_dim'] *= num_heads already multiplies theembed_dim bynum_heads (same operation for all models).
So it looks to me that convit.py#L305 is actually redundant. It's also causing the size error above. When I remove that line, the test works as expected.
Hi, thanks for sharing the code! While I was testing it out with pretrained models on ImageNet21k, I encountered an error:
Looks like the same error encountered by issue #4 (which was addressed by PR #6). But I'm facing the error on all the pretrained models (tiny, small, base). I think I might have found the issue:
On convit.py#L305:
embed_dim *= num_heads
was added as part of #6.But on models.py#20:
kwargs['embed_dim'] *= num_heads
already multiplies theembed_dim
bynum_heads
(same operation for all models).So it looks to me that
convit.py#L305
is actually redundant. It's also causing the size error above. When I remove that line, the test works as expected.