OpenGVLab / VideoMamba

VideoMamba: State Space Model for Efficient Video Understanding
https://arxiv.org/abs/2403.06977
Apache License 2.0
660 stars 47 forks source link

Cost of training #30

Closed algorithmee closed 2 months ago

algorithmee commented 2 months ago

Thank you for your great work! I want to know the exact training time for models of image understanding, short-term video understanding (both supervised and self-supervised), long-term video understanding and multi-modality video understanding respectively. Thank you very much!

Andy1621 commented 2 months ago

Hi! It's hard to summarize different costs and you can run the training code to get the cost for different models. The training of VideoMamba is about 1.5~2x slower than ViT for image understanding and short-term video understanding, though the inference speed is higher.

For VideoMamba-M, our largest model, the 300-epoch fine-tuning on ImageNet costs about 3 days on 16 A100s. The 50-epoch fine-tuning on K400 with 8 frames costs about 2 days on 16 A100s. The 200 epoch masked pretraining on K400 costs about 3 days on 32 A100s. As for the long-term video datasets, it only requires a few hours since the datasets are small.

algorithmee commented 2 months ago

Thank you for your reply! How about the 10-epoch distillation pretraining on WebVid-2M and CC3M (and it seems that you also trained on a 25M dataset like UMT, how about that?). By the way, is “distill VideoMamba-M over 800 epochs” in page 10 of your article actually “200 epochs”?

Andy1621 commented 2 months ago

Thanks for your correction. For K400 and SSv2, we indeed distill it for 800 epochs. (200 epochs was used for K710 in UMT). For 5M data, it's much faster and only requires about 1 day.

algorithmee commented 2 months ago

Thank you for your reply! Did you do the 200-epoch training on K710 before the 5M dataset like UMT? If you did, how long was it? To sum up, you first conducted a 300-epoch pretraining on Imagenet1K for 3 days on 16 A100s. Then you finetuned it on K400 for 2 days 50 epochs on 16 A100s. After doing that, you repretrained a model on K400 and SSv2 for 3days 800 epochs on 32 A100s, and finetuned it on long-term video datasets. Finally, following UMT, you repretrained another model (on K710 and) on a 5M dataset for 1 day 10 epochs on 32 A100s. Can I summarize your training process like this?

Andy1621 commented 2 months ago

That is not totally correct. There are two pretraining ways: supervised and self-supervised.

For supervised: "you first conducted a 300-epoch pretraining on Imagenet1K for 3 days on 16 A100s. Then you finetuned it on K400 for 2 days 50 epochs on 16 A100s"

For self-supervised: "you repretrained a model on K400 and SSv2 for 3days 800 epochs on 32 A100s" and "Then you finetuned it on K400 for 2 days 50 epochs on 16 A100s"

Either of the above models can be fine-tuned on long-term video datasets.

As for multi-modality dataset, we load the self-supervised pretrained weights, and fine-tuned it "on a 5M dataset for 1 day 10 epochs on 32 A100s".

We do not use K710 pretraining in VideoMamba.

Andy1621 commented 2 months ago

All the fine-tuning scripts can be found in our repo. Please check it~

algorithmee commented 2 months ago

Got it! Thank you for your patient reply!