OpenGVLab / InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Apache License 2.0
1.33k stars 85 forks source link

Training and Evaluation Code for ViClip #131

Open fmthoker opened 4 months ago

fmthoker commented 4 months ago

Dear authors, Great work and thanks for releasing the code for ViClip pretraining on InternVid-10M-FLT. Firstly, It would be really great if the pre-trainning instructions are more detailed, like which clip models to start from, paths for config etc. Secondlly, can you please also release the evaluation code and scripts for evaluating pretrained ViCLIP models for zero shot kinetics-400, ssv2, ucf etc. I want to reproduce the number for zero-shot evaluation in my local setup.

Thanks and Regards

Andy1621 commented 4 months ago

Hi! For the zero-shot evaluation, you can refer to the VideoCLIP in InternVideo2.

fmthoker commented 4 months ago

@Andy1621 Thanks for the quick response, are you referring to the scripts in InternVideo/InternVideo2/multi_modality/scripts/evaluation/clip/zero_shot, if so, it seems they are for evaluating InternVideo2 clip. Would the scripts and code work off-the-shelf for not ViClip models that you have shared? Do we need to make any changes? It would also be great if you can share the eval code for ViClip directly. Thanks in advance.

Andy1621 commented 4 months ago

Hi~ You can find the evaluation sctipets here

fmthoker commented 4 months ago

@Andy1621 Thanks for you quick response, will try that to reproduce the results.

fmthoker commented 4 months ago

@Andy1621 I tried to do zero-shot eval on msrvtt-1k with scrpts from here However, I am getting the following errors File "tasks/retrieval.py", line 15, in Traceback (most recent call last): File "tasks/retrieval.py", line 15, in from models.vindlu import VindLU ModuleNotFoundError: No module named 'models.vindlu' from models.vindlu import VindLU ModuleNotFoundError: No module named 'models.vindlu'

Andy1621 commented 4 months ago

I think it's a bug when cleaning the code, you can fix it in tasks/retrieval.py by

# from models.vindlu import VindLU
# from models.vindlu_vit import VindLU_VIT
# from models.vindlu_videoclip import VindLU_VideoCLIP
# from models.vindlu_blip_qformer import VindLU_BLIP_QFormer
from models.viclip import ViCLIP

And also change the model in config.py form VindLU_VideoCLIP to ViCLIP.

fmthoker commented 4 months ago

@Andy1621 Thanks, it solves the problem, however i think the code is still not complete as i get following error:

Traceback (most recent call last): File "tasks/retrieval.py", line 292, in main(cfg) File "tasks/retrieval.py", line 208, in main res = evaluation_wrapper( File "/ibex/ai/home/thokerfm/anaconda3/envs/viclip/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/thokerfm/InternVideo/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py", line 85, in evaluation_wrapper i2t_x, t2i_x, i2t_emb, t2i_emb = evaluation( File "/ibex/ai/home/thokerfm/anaconda3/envs/viclip/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) File "/home/thokerfm/InternVideo/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py", line 132, in evaluation image_feats, pooled_image_feats = extract_vision_feats( File "/home/thokerfm/InternVideo/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py", line 54, in extract_vision_feats image_feat, pooled_image_feat = model.encode_vision(image, test=True) ValueError: too many values to unpack (expected 2)

Code-kunkun commented 3 months ago

@Andy1621 Thanks, it solves the problem, however i think the code is still not complete as i get following error:

Traceback (most recent call last): File "tasks/retrieval.py", line 292, in main(cfg) File "tasks/retrieval.py", line 208, in main res = evaluation_wrapper( File "/ibex/ai/home/thokerfm/anaconda3/envs/viclip/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/thokerfm/InternVideo/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py", line 85, in evaluation_wrapper i2t_x, t2i_x, i2t_emb, t2i_emb = evaluation( File "/ibex/ai/home/thokerfm/anaconda3/envs/viclip/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) File "/home/thokerfm/InternVideo/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py", line 132, in evaluation image_feats, pooled_image_feats = extract_vision_feats( File "/home/thokerfm/InternVideo/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py", line 54, in extract_vision_feats image_feat, pooled_image_feat = model.encode_vision(image, test=True) ValueError: too many values to unpack (expected 2)

Did you solve this problem? I got the same error.

fmthoker commented 3 months ago

@Code-kunkun Yes, you need to change line 79 in tasks/retrieval_utils.py https://github.com/OpenGVLab/InternVideo/blob/10183826112bd7edd983b68b6d7a5faa5d370709/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py#L79 to if config.model.model_cls == "VindLU_VideoCLIP" or config.model.model_cls == "ViCLIP" Let me know if that works

Code-kunkun commented 3 months ago

@Code-kunkun Yes, you need to change line 79 in tasks/retrieval_utils.py

https://github.com/OpenGVLab/InternVideo/blob/10183826112bd7edd983b68b6d7a5faa5d370709/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py#L79

to if config.model.model_cls == "VindLU_VideoCLIP" or config.model.model_cls == "ViCLIP" Let me know if that works

Thanks for your quick reply! It works🥳.

fmthoker commented 3 months ago

@Andy1621 Thanks for your help so far with the zero-shot evaluation, can you please refer to me which scripts/code to use for full fine-tuning of the ViCLIP models? Also, how do we run full finetuning for action classification datasets like ssv2, and kinetics with the current codebase?