Open Balakishan77 opened 7 months ago
MViT implementation uses the PatchEmbed class to "patchify" a 2D image/3D video with embedding projection. For this purpose, a 2D or 3D convolution operation is used as shown below:
If you do not train with the modified PATCH_KERNEL size, and directly try inference, I would expect this convolution operation to not work as the configured model architecture will not match the downloaded model architecture.
Hello,
To what extent does modifying the PATCH_KERNEL size during MVIT inference impact the need for retraining the model, considering factors like the magnitude of change and downstream task performance?
For a given change in PATCH_KERNEL size (e.g., 10% increase), can MVIT still produce accurate results without retraining, or is retraining essential for maintaining performance?