Turning VideoMAEv2 into a next-frame prediction model

OpenGVLab / VideoMAEv2

[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

https://arxiv.org/abs/2303.16727

MIT License

493 stars 56 forks source link

Turning VideoMAEv2 into a next-frame prediction model #40

Open IoSonoMarco opened 11 months ago

IoSonoMarco commented 11 months ago

Great work and thanks for the code!

I was just wondering how you see the chance that with some proper masking strategy you can do full next-frame prediction on an unseen video. This is valid both for VideoMAEv2 and VideoMAE I guess. The proper masking strategy could be just masking the whole (last) frame, given a set of unmasked frames, and then obtaining logits for the reconstructed masked frame. Do you think this is feasible?

congee524 commented 6 months ago

I've done similar experiments and achieved similar results to mae. Based on my limited experimental results:predictive features are easier to train than predictive pixels; the potential of this training method may be higher than mae; and the resource overhead may be greater. There should be some similar (predictive or autoregressive) work recently, such as v-jepa, aim, etc. You could learn more about it.