Multigrid Training Time

mrorro commented 4 years ago

Hi, I'm trying to reproduce the multigrid training on kinetics 400 in more or less 5 hours. I'm using a cluster with 4 V100 per node, SLOWFAST_8x8_R50_stepwise_multigrid.yaml config file with NUM_GPUS 4, NUM_SHARDS 32 and tried some tuning on batch size, workers. While the accuracy seems fine the training time is far from 5 hours even using 256 GPUs. Any advice would be really welcome.

Thank you for sharing your code.

junwenxiong commented 4 years ago

Hi, @mrorro I'm also trying to use multigrid training schedule , but I ran into some troubles with the dataset, the kinetics 400 dataset is too large to download ,could you please tell me how to download it quickly, or share the dataset with me ? Thank you!

mrorro commented 3 years ago

Hi, @mrorro I'm also trying to use multigrid training schedule , but I ran into some troubles with the dataset, the kinetics 400 dataset is too large to download ,could you please tell me how to download it quickly, or share the dataset with me ? Thank you!

I don't know a fast way, I used https://github.com/Showmax/kinetics-downloader and waited with patience

chaoyuaw commented 3 years ago

Hi @mrorro , Thanks for trying our code and thanks for your questions! Sorry for the late reply.

In my experiences, speed issues often has something to do with the data loader speed. Multigrid uses larger batch sizes, so it requires loading and processing more videos at each iteration. Maybe it worths double check the GPU utilization to see if GPU utilization is indeed low, and verify that the issue is indeed in data loading.

For distributed training across multiple machines, one potential source of issue could be the network speed across machines. I don't have a very good way to debug this other than analyzing each step of the data pipeline to figure out whether it's bottlenecked (e.g., is it due to network/transmission speed across machines, hard disk speed, or data augmentation speed, etc.) and then try to optimize for the bottleneck.

donnyyou commented 3 years ago

Hi, I'm trying to reproduce the multigrid training on kinetics 400 in more or less 5 hours. I'm using a cluster with 4 V100 per node, SLOWFAST_8x8_R50_stepwise_multigrid.yaml config file with NUM_GPUS 4, NUM_SHARDS 32 and tried some tuning on batch size, workers. While the accuracy seems fine the training time is far from 5 hours even using 256 GPUs. Any advice would be really welcome.

Thank you for sharing your code.

Could you share how you set the lr and batch size with 4 * 32 GPUs?

chaoyuaw commented 3 years ago

Hi @donnyyou , I follow linear scaling rule (https://arxiv.org/abs/1706.02677) for setting LR. For example, our default uses 8 GPUs for training, so when using 16x more GPUs, we set LR to be 16x larger. The "number of examples per GPU" stay the same when using different number of GPUs.

donnyyou commented 3 years ago

@chaoyuaw Thanks for your reply! I wonder how to set the epoch_factor(1.5X @8GPUs) with 16X GPUs?

chaoyuaw commented 3 years ago

@donnyyou , yes, we use EPOCH_FACTOR=1.5 as our default setting.

mrorro commented 3 years ago

Hi @chaoyuaw . Thanks for answering.

Yes, it seems to have something to do with the data loader. These are the logs for 2 node, 4 gpus each, and 4 WORKERS per node.

[10/12 09:43:22][INFO] train_net.py: 423: Start epoch: 1 [10/12 09:43:22][INFO] train_net.py: 52: Entering in the iter loop [10/12 09:43:55][INFO] train_net.py: 54: Startig iteration: 1 [10/12 09:44:00][INFO] train_net.py: 54: Startig iteration: 2 ... [10/12 10:16:31][INFO] train_net.py: 54: Startig iteration: 687 [10/12 10:16:39][INFO] logging.py: 93: json_stats: {"RAM": "76.12/314.59G", "_type": "train_epoch", "dt": 1.43635, "dt_data": 1.43635, "dt_net": 5.67138, "epoch": "1/2", "eta": "0:16:26", "gpu_mem": "9.45G", "loss": 5.87722, "lr": 0.00971, "top1_err": 98.99188, "top5_err": 95.80035} [10/12 10:16:39][INFO] train_net.py: 309: Start update_bn_stats [10/12 10:24:20][INFO] train_net.py: 311: End update_bn_stats ...

The issue seems with the enumerate in the iteration loop both for train and evaluate. A similar issue and a possible solution is proposed here: https://discuss.pytorch.org/t/enumerate-dataloader-slow/87778

Another issue seems in update_bn_stats from the fvcore library. Since it runs on CPU, increasing the number of WORKSERS could help even here, but not enough. Here is the log with 16 WORKERS. Now also the first iteration seems slow:

[10/12 10:53:17][INFO] train_net.py: 423: Start epoch: 1 [10/12 10:53:17][INFO] train_net.py: 52: Entering in the iter loop [10/12 10:54:54][INFO] train_net.py: 54: Startig iteration: 1 [10/12 10:56:16][INFO] train_net.py: 54: Startig iteration: 2 [10/12 10:56:16][INFO] train_net.py: 54: Startig iteration: 3 [10/12 10:56:17][INFO] train_net.py: 54: Startig iteration: 4 ... [10/12 11:11:07][INFO] logging.py: 93: json_stats: {"RAM": "86.60/314.59G", "_type": "train_epoch", "dt": 2.94581, "dt_data": 2.94581, "dt_net": 0.43552, "epoch": "1/2", "eta": "0:33:43", "gpu_mem": "9.45G", "loss": 5.86814, "lr": 0.00971, "top1_err": 98.90403, "top5_err": 95.56837} [10/12 11:11:07][INFO] train_net.py: 309: Start update_bn_stats [10/12 11:16:03][INFO] train_net.py: 311: End update_bn_stats

Thank you.

chaoyuaw commented 3 years ago

@mrorro , Thanks for providing more information.

I think the speed issue of update_bn_stat, training, evaluation are likely due to the same cause: they're IO-bound. If possible, it might help to use a faster local hard drive to speed up image/video loading.

facebookresearch / SlowFast

Multigrid Training Time #292