OpenGVLab / VideoMAEv2

[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
https://arxiv.org/abs/2303.16727
MIT License
445 stars 45 forks source link

Unable to load the distilled model weights provided in the model zoo #24

Closed druefena closed 1 year ago

druefena commented 1 year ago

How can one load and use the pre-trained distilled models from the model zoo?

First, creating the model using (needed to comment out all non-default params as they are not recognized):

model = create_model(
        'vit_base_patch16_224',
        img_size=224,
        pretrained=False,
        num_classes=710,
        #all_frames=args.num_frames * args.num_segments,
        #tubelet_size=args.tubelet_size,
        #drop_rate=args.drop,
        #drop_path_rate=args.drop_path,
        #attn_drop_rate=args.attn_drop_rate,
        #head_drop_rate=args.head_drop_rate,
        #drop_block_rate=None,
        #use_mean_pooling=args.use_mean_pooling,
        #init_scale=args.init_scale,
        #with_cp=args.with_checkpoint,
    )

When I am trying to load the weights: https://pjlab-gvm-data.oss-cn-shanghai.aliyuncs.com/internvideo/distill/vit_s_k710_dl_from_giant.pth

using the utils.load_state_dict() function, I get multiple errors, including: _size mismatch for patchembed.proj.weight: copying a param with shape torch.Size([768, 3, 2, 16, 16]) from checkpoint, the shape in current model is torch.Size([768, 3, 16, 16]).

I assume this might be because the tubelet size is missing, which by default is set to 2 (and could be the dimension I am missing). So I guess the main question is, how to load the model (and which model)?

Any help appreciated, thanks!