Open fmthoker opened 5 months ago
Hi! For the zero-shot evaluation, you can refer to the VideoCLIP in InternVideo2.
@Andy1621 Thanks for the quick response, are you referring to the scripts in InternVideo/InternVideo2/multi_modality/scripts/evaluation/clip/zero_shot, if so, it seems they are for evaluating InternVideo2 clip. Would the scripts and code work off-the-shelf for not ViClip models that you have shared? Do we need to make any changes? It would also be great if you can share the eval code for ViClip directly. Thanks in advance.
@Andy1621 Thanks for you quick response, will try that to reproduce the results.
@Andy1621 I tried to do zero-shot eval on msrvtt-1k with scrpts from here
However, I am getting the following errors
File "tasks/retrieval.py", line 15, in
I think it's a bug when cleaning the code, you can fix it in tasks/retrieval.py
by
# from models.vindlu import VindLU
# from models.vindlu_vit import VindLU_VIT
# from models.vindlu_videoclip import VindLU_VideoCLIP
# from models.vindlu_blip_qformer import VindLU_BLIP_QFormer
from models.viclip import ViCLIP
And also change the model in config.py
form VindLU_VideoCLIP
to ViCLIP
.
@Andy1621 Thanks, it solves the problem, however i think the code is still not complete as i get following error:
Traceback (most recent call last):
File "tasks/retrieval.py", line 292, in
@Andy1621 Thanks, it solves the problem, however i think the code is still not complete as i get following error:
Traceback (most recent call last): File "tasks/retrieval.py", line 292, in main(cfg) File "tasks/retrieval.py", line 208, in main res = evaluation_wrapper( File "/ibex/ai/home/thokerfm/anaconda3/envs/viclip/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, *kwargs) File "/home/thokerfm/InternVideo/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py", line 85, in evaluation_wrapper i2t_x, t2i_x, i2t_emb, t2i_emb = evaluation( File "/ibex/ai/home/thokerfm/anaconda3/envs/viclip/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) File "/home/thokerfm/InternVideo/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py", line 132, in evaluation image_feats, pooled_image_feats = extract_vision_feats( File "/home/thokerfm/InternVideo/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py", line 54, in extract_vision_feats image_feat, pooled_image_feat = model.encode_vision(image, test=True) ValueError: too many values to unpack (expected 2)
Did you solve this problem? I got the same error.
@Code-kunkun Yes, you need to change line 79 in tasks/retrieval_utils.py https://github.com/OpenGVLab/InternVideo/blob/10183826112bd7edd983b68b6d7a5faa5d370709/InternVideo1/Pretrain/ViCLIP/tasks/retrieval_utils.py#L79 to if config.model.model_cls == "VindLU_VideoCLIP" or config.model.model_cls == "ViCLIP" Let me know if that works
@Code-kunkun Yes, you need to change line 79 in tasks/retrieval_utils.py
to if config.model.model_cls == "VindLU_VideoCLIP" or config.model.model_cls == "ViCLIP" Let me know if that works
Thanks for your quick reply! It works🥳.
@Andy1621 Thanks for your help so far with the zero-shot evaluation, can you please refer to me which scripts/code to use for full fine-tuning of the ViCLIP models? Also, how do we run full finetuning for action classification datasets like ssv2, and kinetics with the current codebase?
Dear authors, Great work and thanks for releasing the code for ViClip pretraining on InternVid-10M-FLT. Firstly, It would be really great if the pre-trainning instructions are more detailed, like which clip models to start from, paths for config etc. Secondlly, can you please also release the evaluation code and scripts for evaluating pretrained ViCLIP models for zero shot kinetics-400, ssv2, ucf etc. I want to reproduce the number for zero-shot evaluation in my local setup.
Thanks and Regards