Closed betterze closed 3 months ago
Thanks for your question. In my opinion, the foundation model
depends on your downstream tasks. If your task is single-modality, you can simply use the K400 pretraining. If your task is multi-modality, you can simply use the 25M pretraining.
The current size of VideoMamba is small, and I don't think it will achieve SOTA performances compared with larger-scale models. However, under the same model size, I think it will achieve better performances at most of the tasks as shown in our recent work.
get it, thx a lot.
Dear VideoMamba team,
Thank you for sharing this great work; I really enjoyed it.
If I understand correctly, the model zoo contains different models trained on different datasets. Is there a foundation model there that is trained on all kinds of video datasets? We want to use a general model to extract video features.
Thank you for your help.
Best Wishes,
Zongze