Open TreeinRandomForest opened 3 years ago
pytorch DDP overlaps the computation of gradients on a given batch with the communication of previous ("more forward" in the network) gradients to other nodes. See: https://pytorch.org/docs/stable/notes/ddp.html. The balance is encoded in an argument bucket_cap_mb (see: https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html).
Do a line search across bucket_cap_mb in the first few iterations to optimize wall clock time.
pytorch DDP overlaps the computation of gradients on a given batch with the communication of previous ("more forward" in the network) gradients to other nodes. See: https://pytorch.org/docs/stable/notes/ddp.html. The balance is encoded in an argument bucket_cap_mb (see: https://pytorch.org/docs/master/generated/torch.nn.parallel.DistributedDataParallel.html).
Do a line search across bucket_cap_mb in the first few iterations to optimize wall clock time.