Closed wengzejia1 closed 3 weeks ago
Also the current code seems to have some problems in stage-1 evaluation. I modify the code to run the evaluation process for your released stage-1 model checkpoint. But the result is strange that the VTM results are worse than the VTC results. Can you help me verify this. Or can you release the evaluation for stage-1? Thank you.
Hi! You may refer to BLIP2 for help. In my memory, the stage-1 model does not work better than UMT.
Once I resume your released stage-1 model and continual the stage-1 training process, it seems the VTM results will go better than the VTC results, while my testing on released stage-1 model checkpoint shows VTM results are worse than the VTC results. I would be appreciate if you can update the stage-1 evaluation code, and give the stage-1 zero-shot retrieval results for your released stage-1 model.
Hi! It may be difficult to release the stage-1 evaluated results, since it was done by another intern who has quit. ðŸ˜
Also it seems the code of loading pretrained UMT model codeline has some problems, because of the misleading name prefix "vision_encoder." The parameter names of the UMT vision encoder in umt-l16 contains the "vision_encoder" prefix, while in the codeline, parameters of the vit model do not contain that prefix still. That will cause the failure of the pretrained model loading and bring the failure of reimplementation of stage-1.
I would be appreciate if you can check whether that bug exists. Thank you so much.
Hi! It may be difficult to release the stage-1 evaluated results, since it was done by another intern who has quit. ðŸ˜
Could you tell me the name of the author who did the first stage training? Maybe I can email him for consultation. 😬
Hi! It may be difficult to release the stage-1 evaluated results, since it was done by another intern who has quit. ðŸ˜
Could you tell me the name of the author who did the first stage training? Maybe I can email him for consultation. 😬
Yizhuo Li conducts the experiment~
Hi, we will close this issue.
Feel free to contact us if you have other questions.
Hello. Can you apply the evaluation results (especially the zero-shot retrieval performance on MSR-VTT dataset) for the videochat2 stage-1 model. Should it perform better the UMT model or not? Thanks.