OpenGVLab / VideoMAEv2

[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
https://arxiv.org/abs/2303.16727
MIT License
493 stars 56 forks source link

Finetuning with more than 16 frames #58

Open CSLR-research opened 4 months ago

CSLR-research commented 4 months ago

I pretrained a vit_small_patch16_224 model and want to finetune it using more frames. I receive this error when using 32 frames, when loading the checkpoint.

pos_tokens = pos_tokens.reshape(-1, T, P, P, C) RuntimeError: shape '[-1, 8, 19, 19, 384]' is invalid for input of size 1204224