OpenGVLab / VideoMamba

VideoMamba: State Space Model for Efficient Video Understanding
https://arxiv.org/abs/2403.06977
Apache License 2.0
660 stars 47 forks source link

How to apply a VideoMamba to videos of different temporal resolutions with the pretraining? #29

Closed makecent closed 2 months ago

makecent commented 2 months ago

I read in the paper that the VideoMamba model pre-trained on K400 has been applied to long-term video datasets such as BreakFast, COIN, and LVU. I'm working on a similar application where I need to adapt the pre-trained VideoMamba to downstream videos with different temporal resolutions.

Could you elaborate on the techniques used to adapt the model from short-term datasets like K400 to long-term datasets such as LVU, especially the handlling of tenporal positional encoding? For example, is interpolation on used to manage these differences?

Thanks for your help!

makecent commented 2 months ago

From this released checkpoint, it seems that you simply re-train a VideoMamba on K400 with long-term input, instead of interpolating the positional embeddings to fit variable input resolutions.

franklio commented 2 months ago

Can you explain your folder structure? I changed DATA_PATH and PREFIX to the files downloaded here and run bash ./exp/breakfast/videomamba_middle_mask/run_f32x224.sh, but it didn't work. Is there anything specific I should pay attention to? Thank you for your help!