Thanks for open-sourcing this great work. I'm wondering how many GPU resources are needed to reproduce the results.
In the abstract you reported 32 A100s (80 GB) for 6 days when training from scratch. In Section 4.1 you mentioned 24 and 40 hours for stage 2 using base and large model on 25M corpus.
Is my understanding correct that the first stage (video only) takes about "6 days minus 40 hours" while the second stage takes 40 hours for ViT-L/16?
Also, is the codebase in the layout that the "./single_modality" corresponds to the first stage and "./multi_modality" corresponds to the second stage?
Sorry for the late response.
Yes, Stage1 costs most of the time, and it requires "6 days minus 40 hours" for satge1.
And the "./single_modality" is related to the Stage1.
Hi,
Thanks for open-sourcing this great work. I'm wondering how many GPU resources are needed to reproduce the results.
In the abstract you reported 32 A100s (80 GB) for 6 days when training from scratch. In Section 4.1 you mentioned 24 and 40 hours for stage 2 using base and large model on 25M corpus.
Is my understanding correct that the first stage (video only) takes about "6 days minus 40 hours" while the second stage takes 40 hours for ViT-L/16?
Also, is the codebase in the layout that the "./single_modality" corresponds to the first stage and "./multi_modality" corresponds to the second stage?
Thanks very much.