distributed training - Githubissues

This is the companion PR to a beacon-internal project.

It defines a DistributedClassifier with an impl of train! that sends batch specs (whatever info workers need to know what their next batch is) to workers, where losses and gradients get computed and sent back. The driver node performs parameter updates after summing the gradients from all workers. This is a purely synchronous distributed training loop: the workers are always running with the latest version of the model, which ensures that the model converges and performs the same as it would if it were trained locally (there are no such guarantees for asynchronous training schemes where workers are often running with stale model params.)

It also defines a DistributedLogger allowing workers to send back logs to the driver node. This hasn't been tested, beyond it not barfing.

It also defines some utilities in distributed/ that are not specific to Lighthouse or Flux, and which should be moved somewhere else eventually.

beacon-biosignals / LighthouseFlux.jl

distributed training #12