Note: since we are still waiting for GPUs, use MNIST as a proxy for now.
The simplest parallelization technique is splitting a batch across multiple GPUs. This is entirely synchronous i.e. some processes will be idle if one takes much longer to finish. PyTorch calls this Distributed Data Parallel (DDP - https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).
Note: since we are still waiting for GPUs, use MNIST as a proxy for now.
The simplest parallelization technique is splitting a batch across multiple GPUs. This is entirely synchronous i.e. some processes will be idle if one takes much longer to finish. PyTorch calls this Distributed Data Parallel (DDP - https://pytorch.org/tutorials/intermediate/ddp_tutorial.html).
Make the following plot: x-axis: train time y-axis: test score (accuracy for MNIST, https://www.kaggle.com/c/diabetic-retinopathy-detection/overview/evaluation for retinopathy) One curve per num_workers (=1,2,3,..,N where N = total number of available GPUs)
Interesting paper: Don't decay the learning rate, increase the batch size https://arxiv.org/abs/1711.00489