I read in the paper that the VideoMamba model pre-trained on K400 has been applied to long-term video datasets such as BreakFast, COIN, and LVU. I'm working on a similar application where I need to adapt the pre-trained VideoMamba to downstream videos with different temporal resolutions.
Could you elaborate on the techniques used to adapt the model from short-term datasets like K400 to long-term datasets such as LVU, especially the handlling of tenporal positional encoding? For example, is interpolation on used to manage these differences?
From this released checkpoint, it seems that you simply re-train a VideoMamba on K400 with long-term input, instead of interpolating the positional embeddings to fit variable input resolutions.
Can you explain your folder structure? I changed DATA_PATH and PREFIX to the files downloaded here and run bash ./exp/breakfast/videomamba_middle_mask/run_f32x224.sh, but it didn't work. Is there anything specific I should pay attention to? Thank you for your help!
I read in the paper that the VideoMamba model pre-trained on K400 has been applied to long-term video datasets such as BreakFast, COIN, and LVU. I'm working on a similar application where I need to adapt the pre-trained VideoMamba to downstream videos with different temporal resolutions.
Could you elaborate on the techniques used to adapt the model from short-term datasets like K400 to long-term datasets such as LVU, especially the handlling of tenporal positional encoding? For example, is interpolation on used to manage these differences?
Thanks for your help!