Closed algorithmee closed 2 months ago
Hi! It's hard to summarize different costs and you can run the training code to get the cost for different models. The training of VideoMamba is about 1.5~2x slower than ViT for image understanding and short-term video understanding, though the inference speed is higher.
For VideoMamba-M, our largest model, the 300-epoch fine-tuning on ImageNet costs about 3 days on 16 A100s. The 50-epoch fine-tuning on K400 with 8 frames costs about 2 days on 16 A100s. The 200 epoch masked pretraining on K400 costs about 3 days on 32 A100s. As for the long-term video datasets, it only requires a few hours since the datasets are small.
Thank you for your reply! How about the 10-epoch distillation pretraining on WebVid-2M and CC3M (and it seems that you also trained on a 25M dataset like UMT, how about that?). By the way, is “distill VideoMamba-M over 800 epochs” in page 10 of your article actually “200 epochs”?
Thanks for your correction. For K400 and SSv2, we indeed distill it for 800 epochs. (200 epochs was used for K710 in UMT). For 5M data, it's much faster and only requires about 1 day.
Thank you for your reply! Did you do the 200-epoch training on K710 before the 5M dataset like UMT? If you did, how long was it? To sum up, you first conducted a 300-epoch pretraining on Imagenet1K for 3 days on 16 A100s. Then you finetuned it on K400 for 2 days 50 epochs on 16 A100s. After doing that, you repretrained a model on K400 and SSv2 for 3days 800 epochs on 32 A100s, and finetuned it on long-term video datasets. Finally, following UMT, you repretrained another model (on K710 and) on a 5M dataset for 1 day 10 epochs on 32 A100s. Can I summarize your training process like this?
That is not totally correct. There are two pretraining ways: supervised and self-supervised.
For supervised: "you first conducted a 300-epoch pretraining on Imagenet1K for 3 days on 16 A100s. Then you finetuned it on K400 for 2 days 50 epochs on 16 A100s"
For self-supervised: "you repretrained a model on K400 and SSv2 for 3days 800 epochs on 32 A100s" and "Then you finetuned it on K400 for 2 days 50 epochs on 16 A100s"
Either of the above models can be fine-tuned on long-term video datasets.
As for multi-modality dataset, we load the self-supervised pretrained weights, and fine-tuned it "on a 5M dataset for 1 day 10 epochs on 32 A100s".
We do not use K710 pretraining in VideoMamba.
All the fine-tuning scripts can be found in our repo. Please check it~
Got it! Thank you for your patient reply!
Thank you for your great work! I want to know the exact training time for models of image understanding, short-term video understanding (both supervised and self-supervised), long-term video understanding and multi-modality video understanding respectively. Thank you very much!