facebookresearch / SlowFast

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.
Apache License 2.0
6.53k stars 1.21k forks source link

Is it necessary to retrain the MVIT model when modifying the PATCH_KERNEL parameter during inference? #694

Open Balakishan77 opened 7 months ago

Balakishan77 commented 7 months ago

Hello,

To what extent does modifying the PATCH_KERNEL size during MVIT inference impact the need for retraining the model, considering factors like the magnitude of change and downstream task performance?

For a given change in PATCH_KERNEL size (e.g., 10% increase), can MVIT still produce accurate results without retraining, or is retraining essential for maintaining performance?

alpargun commented 7 months ago

MViT implementation uses the PatchEmbed class to "patchify" a 2D image/3D video with embedding projection. For this purpose, a 2D or 3D convolution operation is used as shown below:

Screenshot 2024-02-17 at 04 41 14

If you do not train with the modified PATCH_KERNEL size, and directly try inference, I would expect this convolution operation to not work as the configured model architecture will not match the downloaded model architecture.