OpenGVLab / VideoMamba

VideoMamba: State Space Model for Efficient Video Understanding
https://arxiv.org/abs/2403.06977
Apache License 2.0
662 stars 48 forks source link

foundtation model #13

Closed betterze closed 3 months ago

betterze commented 3 months ago

Dear VideoMamba team,

Thank you for sharing this great work; I really enjoyed it.

If I understand correctly, the model zoo contains different models trained on different datasets. Is there a foundation model there that is trained on all kinds of video datasets? We want to use a general model to extract video features.

Thank you for your help.

Best Wishes,

Zongze

Andy1621 commented 3 months ago

Thanks for your question. In my opinion, the foundation model depends on your downstream tasks. If your task is single-modality, you can simply use the K400 pretraining. If your task is multi-modality, you can simply use the 25M pretraining.

The current size of VideoMamba is small, and I don't think it will achieve SOTA performances compared with larger-scale models. However, under the same model size, I think it will achieve better performances at most of the tasks as shown in our recent work.

betterze commented 2 months ago

get it, thx a lot.