Closed dfan closed 2 years ago
The paper mentions that frames are sampled with a temporal stride of 2 for SSv2 and the pretraining script sets the sampling rate to 2. But for finetuning, it seems the temporal stride is set to 4. Is this intentional or a mistake?
Pretrain script: https://github.com/MCG-NJU/VideoMAE/blob/main/scripts/ssv2/videomae_vit_base_patch16_224_tubemasking_ratio_0.9_epoch_800/pretrain.sh#L18 Finetune script: https://github.com/MCG-NJU/VideoMAE/blob/main/scripts/ssv2/videomae_vit_base_patch16_224_tubemasking_ratio_0.9_epoch_800/finetune.sh#L25 Default sampling rate of 4: https://github.com/MCG-NJU/VideoMAE/blob/main/run_class_finetuning.py#L145
Sorry I just realized SSv2 is doing TSN style sampling so there is no notion of sampling rate.
The paper mentions that frames are sampled with a temporal stride of 2 for SSv2 and the pretraining script sets the sampling rate to 2. But for finetuning, it seems the temporal stride is set to 4. Is this intentional or a mistake?
Pretrain script: https://github.com/MCG-NJU/VideoMAE/blob/main/scripts/ssv2/videomae_vit_base_patch16_224_tubemasking_ratio_0.9_epoch_800/pretrain.sh#L18 Finetune script: https://github.com/MCG-NJU/VideoMAE/blob/main/scripts/ssv2/videomae_vit_base_patch16_224_tubemasking_ratio_0.9_epoch_800/finetune.sh#L25 Default sampling rate of 4: https://github.com/MCG-NJU/VideoMAE/blob/main/run_class_finetuning.py#L145