OpenGVLab / unmasked_teacher

[ICCV2023 Oral] Unmasked Teacher: Towards Training-Efficient Video Foundation Models
https://arxiv.org/abs/2303.16058
MIT License
285 stars 15 forks source link

Images at different resolutions compared with training size #5

Closed littlespray closed 1 year ago

littlespray commented 1 year ago

Thank you for your nice work! Will it work well for inputs with a large difference in resolution compared to the training size, such as 386x1024?

Also, I am wondering if the positional embedding supports flexible image sizes. Could I fine-tune the same model on datasets with different resolutions?

Andy1621 commented 1 year ago

Actually, for MiTV1, I finetune the models with 384x384 input. For larger input resolution, since I adopt static sin-cos position embedding, it needs to resize as in https://github.com/OpenGVLab/unmasked_teacher/blob/496ad05ceb1be873e2f5b3d56bc7606b84104bda/single_modality/models/modeling_finetune.py#L171-L184