aicoe-kaggle / diabetic-retinopathy

Other
0 stars 0 forks source link

Optimal PyTorch bucketing value #12

Open TreeinRandomForest opened 3 years ago

TreeinRandomForest commented 3 years ago

pytorch DDP overlaps the computation of gradients on a given batch with the communication of previous ("more forward" in the network) gradients to other nodes. See: https://pytorch.org/docs/stable/notes/ddp.html. The balance is encoded in an argument bucket_cap_mb (see: https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html).

Do a line search across bucket_cap_mb in the first few iterations to optimize wall clock time.