Hello! Can you publish the basic model of ViT-base? Or is it just a ViT-base distillation model?

OpenGVLab / VideoMAEv2

[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

https://arxiv.org/abs/2303.16727

MIT License

447 stars 45 forks source link

Hello! Can you publish the basic model of ViT-base? Or is it just a ViT-base distillation model? #10

Closed DragonWang-cell closed 1 year ago

congee524 commented 1 year ago

Hi, for the same size ViT-variant we don't want to offer too many options, which might confuse our users.

You could perhaps use internvideo's model. This model was pre-trained for 800 epochs on the same UnlabeledHybrid dataset without dual masking and is close to the performance of our VideoMAE V2-B which was pre-trained for 1200 epochs with dual masking.