model zero-shot retrieval capability of the videochat2 stage-1 model - Githubissues

OpenGVLab / Ask-Anything

[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.

https://vchat.opengvlab.com/

MIT License

3.05k stars 252 forks source link

model zero-shot retrieval capability of the videochat2 stage-1 model #212

Closed wengzejia1 closed 3 weeks ago

wengzejia1 commented 3 months ago

Hello. Can you apply the evaluation results (especially the zero-shot retrieval performance on MSR-VTT dataset) for the videochat2 stage-1 model. Should it perform better the UMT model or not? Thanks.

wengzejia1 commented 3 months ago

Also the current code seems to have some problems in stage-1 evaluation. I modify the code to run the evaluation process for your released stage-1 model checkpoint. But the result is strange that the VTM results are worse than the VTC results. Can you help me verify this. Or can you release the evaluation for stage-1? Thank you.

Andy1621 commented 3 months ago

Hi! You may refer to BLIP2 for help. In my memory, the stage-1 model does not work better than UMT.

wengzejia1 commented 3 months ago

Once I resume your released stage-1 model and continual the stage-1 training process, it seems the VTM results will go better than the VTC results, while my testing on released stage-1 model checkpoint shows VTM results are worse than the VTC results. I would be appreciate if you can update the stage-1 evaluation code, and give the stage-1 zero-shot retrieval results for your released stage-1 model.

Andy1621 commented 3 months ago

Hi! It may be difficult to release the stage-1 evaluated results, since it was done by another intern who has quit. 😭

wengzejia1 commented 3 months ago

Also it seems the code of loading pretrained UMT model codeline has some problems, because of the misleading name prefix "vision_encoder." The parameter names of the UMT vision encoder in umt-l16 contains the "vision_encoder" prefix, while in the codeline, parameters of the vit model do not contain that prefix still. That will cause the failure of the pretrained model loading and bring the failure of reimplementation of stage-1.

I would be appreciate if you can check whether that bug exists. Thank you so much.

wengzejia1 commented 3 months ago

Hi! It may be difficult to release the stage-1 evaluated results, since it was done by another intern who has quit. 😭

Could you tell me the name of the author who did the first stage training? Maybe I can email him for consultation. 😬

Andy1621 commented 3 months ago

Hi! It may be difficult to release the stage-1 evaluated results, since it was done by another intern who has quit. 😭

Could you tell me the name of the author who did the first stage training? Maybe I can email him for consultation. 😬

Yizhuo Li conducts the experiment~

yinanhe commented 3 weeks ago

Hi, we will close this issue.

Feel free to contact us if you have other questions.