Thank you for your work! But I have a question about zero shot video-retrieval task on activitynet dataset, which pretrain model I should use to reproduce the performance?Is Clip ViT-L-14.pt? Thank you for your response!
Apologies for the delayed response. In InternVideo1, we utilize CLIP-VIT for pretraining, whereas in InternVideo2, we train the vision model from scratch.
Thank you for your work! But I have a question about zero shot video-retrieval task on activitynet dataset, which pretrain model I should use to reproduce the performance?Is Clip ViT-L-14.pt? Thank you for your response!