facebookresearch / LaViLa

Code release for "Learning Video Representations from Large Language Models"
MIT License
491 stars 46 forks source link

LaViLa as feature extractor #22

Open deepsurbhi8 opened 1 year ago

deepsurbhi8 commented 1 year ago

I need to extract only visual features from LaViLa's pretrained weights but it is taking only 4 frames into consideration. Is there any way that i can extract features using more frames like 16 or 32? What script to be run? Please reply!!

zhaoyue-zephyrus commented 1 year ago

Hi @deepsurbhi8 ,

It is conceptually possible to extract features with the number of frames other than 4. The only difference is that you need to linearly interpolate the positional embedding and we've actually provided a function to do this. Also please take a look at this fine-tuning example to see how we load a pre-trained model with 4 frames and fine-tune it on a downstream dataset with different frames.

Hope this helps.

Best, Yue

deepsurbhi8 commented 1 year ago

Thanks for the reply!! I have one more doubt: I need to use "CLIP_OPENAI_TIMESFORMER_LARGE" model but where can i find the pretrained weights for the same?

zhaoyue-zephyrus commented 1 year ago

Hi @deepsurbhi8 ,

You might want to take a look at https://github.com/facebookresearch/LaViLa/blob/main/docs/MODEL_ZOO.md#narrator (the second column from the right).

Best, Yue