facebookresearch / LaViLa

Code release for "Learning Video Representations from Large Language Models"
MIT License
491 stars 46 forks source link

May be repeatedly loade the checkpoints. #32

Closed yyvhang closed 5 months ago

yyvhang commented 8 months ago

Hi, when I run the demo code, I notice that the vision_model is first loaded the 'openai_clip_ViT-L-14-336px'.pt in the function 'VCLM_OPENAI_TIMESFORMER_LARGE_336PX_GPT2_XL'. But in 'demo_narrator.py', this part of the parameters were re-covered by the given URL ckpts: 'vclm_openai xxx.pth'.

When inferring, is it unnecessary to load the clip_VIT-L parameters in the function 'VCLM_OPENAI_TIMESFORMER_LARGE_336PX_GPT2_XL'?

zhaoyue-zephyrus commented 6 months ago

Hi @yyvhang ,

Good catch. Feel free to get rid of the CLIP weights if you are loading the VCLM checkpoint

Best, Yue