OpenGVLab / unmasked_teacher

[ICCV2023 Oral] Unmasked Teacher: Towards Training-Efficient Video Foundation Models
https://arxiv.org/abs/2303.16058
MIT License
267 stars 13 forks source link

GPU resources to reproduce results #35

Closed wufeim closed 4 months ago

wufeim commented 4 months ago

Hi,

Thanks for open-sourcing this great work. I'm wondering how many GPU resources are needed to reproduce the results.

In the abstract you reported 32 A100s (80 GB) for 6 days when training from scratch. In Section 4.1 you mentioned 24 and 40 hours for stage 2 using base and large model on 25M corpus.

Is my understanding correct that the first stage (video only) takes about "6 days minus 40 hours" while the second stage takes 40 hours for ViT-L/16?

Also, is the codebase in the layout that the "./single_modality" corresponds to the first stage and "./multi_modality" corresponds to the second stage?

Thanks very much.

Andy1621 commented 4 months ago

Sorry for the late response.
Yes, Stage1 costs most of the time, and it requires "6 days minus 40 hours" for satge1. And the "./single_modality" is related to the Stage1.

wufeim commented 4 months ago

Thanks very much for the information!