facebookresearch / LaViLa

Code release for "Learning Video Representations from Large Language Models"
MIT License
491 stars 46 forks source link

Training Time #3

Closed mmaaz60 closed 1 year ago

mmaaz60 commented 1 year ago

Hi Team,

Thank you for sharing the great work. I wanted to ask if it is possible for you to provide the training time for each of the task in the paper, on a particular dataset? For example, how much time the pretraining will take on 32 A100 40GB GPUs? How about the other down stream tasks? Thank you and looking forward to your response.

zhaoyue-zephyrus commented 1 year ago

Hi @mmaaz60 ,

Thank you for your interests in our work.

For the downstream tasks, we've attached the training log at the MODEL_ZOO from which you can probably read out the training time.

For the pre-training task, it takes ~5hr/epoch to train a dual-encoder baseline (i.e. ground-truth narrations only). We train it for 5 epochs. For LaViLa-style Dual-Encoder, the training time is 2x since we alternately sample text between Narrator and Rephraser.

Note that the hardwares where we perform pre-training/fine-tuning are 32x V100 32GB GPUs (unless otherwise specified). I haven't experimented with A100 but from some external comparison, the speedup is approximately 2~3x (it varies due to model/precision/data IO). You might reduce your ETA accordingly.

Hope this answers your question.

Best, Yue

zhaoyue-zephyrus commented 1 year ago

Closing this issue due to inactivity... Feel free to re-open the issue if needed.