OpenGVLab / unmasked_teacher

[ICCV2023 Oral] Unmasked Teacher: Towards Training-Efficient Video Foundation Models
https://arxiv.org/abs/2303.16058
MIT License
300 stars 16 forks source link

Some questions about pretraining #43

Open Chuan-shanjia opened 4 months ago

Chuan-shanjia commented 4 months ago

Hello! I'm very interested in your great work! I have two questions about pretraining. Does the generalization ability of UMT come from CLIP? With this in mind, regardless of what kind of pre-training dataset is used, it is all about approaching the effectiveness of the weights of the open-source CLIP. So do is the choice of pretraing dataset in stage1 important? Here's another question. Is the pre-training in stage2 helpful for visual-only tasks? If we finetune visual-only dataset on stage2 pretrained model, will it outperform stage1 pretrained model? Looking forward to you reply!

Andy1621 commented 4 months ago
  1. The high-quality video will be better since I have used Webvid which is ~10x more than K400 with 1/10 epochs, but the result is worse. That's why I only use videos from action recognition datasets, see InternVideo2.
  2. Good question! Under a full-tuning setting, stage 2's checkpoint performs similarly to stage 1's checkpoint. But under a frozen-tuning setting, the multi-modal training helps and performs much better.
Chuan-shanjia commented 4 months ago

Your answer is really helpful, thank youI! If I want to utilize the model to other video domains rather than action recognition. Will it be helpful to perform continue pretrain(stage1) on those videos? Or do you have any suggestions for improving performance in other video domains? Looking forward to your reply!

Andy1621 commented 4 months ago

Sorry for late response. You can use the models with masked pretraining.