Open deepsurbhi8 opened 1 year ago
Hi @deepsurbhi8 ,
It is conceptually possible to extract features with the number of frames other than 4. The only difference is that you need to linearly interpolate the positional embedding and we've actually provided a function to do this. Also please take a look at this fine-tuning example to see how we load a pre-trained model with 4 frames and fine-tune it on a downstream dataset with different frames.
Hope this helps.
Best, Yue
Thanks for the reply!! I have one more doubt: I need to use "CLIP_OPENAI_TIMESFORMER_LARGE" model but where can i find the pretrained weights for the same?
Hi @deepsurbhi8 ,
You might want to take a look at https://github.com/facebookresearch/LaViLa/blob/main/docs/MODEL_ZOO.md#narrator (the second column from the right).
Best, Yue
I need to extract only visual features from LaViLa's pretrained weights but it is taking only 4 frames into consideration. Is there any way that i can extract features using more frames like 16 or 32? What script to be run? Please reply!!