facebookresearch / SlowFast

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.
Apache License 2.0
6.4k stars 1.19k forks source link

Multigrid training on Sthv2 #429

Open tgy97 opened 3 years ago

tgy97 commented 3 years ago

Hi, I tried to train a SlowFast model on Sthv2 with multigrid following the default setting in the repo. I used 8GPUs and set TRAIN.BATCH_SIZE as 64 (8 clip per GPU).

Multigrid works well and gives the expected speedup when batch_size is enlarged by multigrid long cycle schedule, i.e. 8x, 4x and 2x. However, when batch_size is set as 1x(for example, in the last finetune epochs), the time cost becomes incredibly large(even many times slower than baseline with same TRAIN.BATCH_SIZE), which makes the whole training time even longer than baseline.

I tried to find what happened in the 1x long cycle epochs. And there are something strange with the short cycle setting. In the paper, the short cycle contains 3 shapes: 4x, 2x and 1x, and it is same in the 1x and 2x long cycle epochs. But in the 8x and 4x long cycle epochs, the short cycle shapes become 2x, 1x and 1x because the short cycle shapes are calculated based on the DATA.TRAIN_CROP_SIZE as below, where DATA.TRAIN_CROP_SIZE would be changed by long cycle schedule. https://github.com/facebookresearch/SlowFast/blob/923e9470b30ac7749af563362eed77339375c82e/slowfast/datasets/multigrid_helper.py#L49-L60

I wonder why the short cycle shapes are different when DATA.TRAIN_CROP_SIZE is changed by long cycle schedule? Is this a bug or is it deliberately set like this? And does this have anything to do with the huge time overhead in 1x long cycle schedule?

Here is the logfile for multigrid and baseline I runed. FYI. https://drive.google.com/drive/folders/1Jz-I0bDcRRca4Ij1oHBfJDmfUGtF_blr?usp=sharing

haooooooqi commented 3 years ago

Hey @tgy97,

Thanks for playing with PySF! Maybe we could leverage some help from the professional! @chaoyuaw

:V Haoqi