Optimal PyTorch bucketing value

pytorch DDP overlaps the computation of gradients on a given batch with the communication of previous ("more forward" in the network) gradients to other nodes. See: https://pytorch.org/docs/stable/notes/ddp.html. The balance is encoded in an argument bucket_cap_mb (see: https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html).

Do a line search across bucket_cap_mb in the first few iterations to optimize wall clock time.

aicoe-kaggle / diabetic-retinopathy