MCG-NJU / VideoMAE

[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
https://arxiv.org/abs/2203.12602
Other
1.37k stars 136 forks source link

Difference in Temporal Stride between Pretraining and Finetuning on SSv2 #59

Closed dfan closed 2 years ago

dfan commented 2 years ago

The paper mentions that frames are sampled with a temporal stride of 2 for SSv2 and the pretraining script sets the sampling rate to 2. But for finetuning, it seems the temporal stride is set to 4. Is this intentional or a mistake?

Pretrain script: https://github.com/MCG-NJU/VideoMAE/blob/main/scripts/ssv2/videomae_vit_base_patch16_224_tubemasking_ratio_0.9_epoch_800/pretrain.sh#L18 Finetune script: https://github.com/MCG-NJU/VideoMAE/blob/main/scripts/ssv2/videomae_vit_base_patch16_224_tubemasking_ratio_0.9_epoch_800/finetune.sh#L25 Default sampling rate of 4: https://github.com/MCG-NJU/VideoMAE/blob/main/run_class_finetuning.py#L145

dfan commented 2 years ago

Sorry I just realized SSv2 is doing TSN style sampling so there is no notion of sampling rate.