Closed dfan closed 1 year ago
That is total wall clock time given the same number of epochs. Note that your speed-up will heavily depend on the rest of your setup, especially during training.
Training is only as fast as your weakest link, so if you get no speed-up from ToMe that means your performance isn't dictated by the model, but some other bottleneck (dataloading, inter-node gradient sync time, etc.).
For video, we use repeated augmentation (i.e., each data sample is augmented 4 times instead of pulling 4 different video clips, this is default for mae st), which speeds up dataloading. The benchmark was also performed on just 1 (8 gpu) node to minimize gradient sync time.
Thank you for those details! That's very useful to know
I'll close this issue for now. Feel free to reopen if you need more clarification.
Is the finetune training time reported in Table 6 for the same number of epochs, or total wall clock time to convergence? I don't observe a noticeable reduction in training speed per iteration, however I can replicate the 2x inference speedup when r=65.