This is the companion PR to a beacon-internal project.
It defines a DistributedClassifier with an impl of train! that sends batch specs (whatever info workers need to know what their next batch is) to workers, where losses and gradients get computed and sent back. The driver node performs parameter updates after summing the gradients from all workers. This is a purely synchronous distributed training loop: the workers are always running with the latest version of the model, which ensures that the model converges and performs the same as it would if it were trained locally (there are no such guarantees for asynchronous training schemes where workers are often running with stale model params.)
It also defines a DistributedLogger allowing workers to send back logs to the driver node. This hasn't been tested, beyond it not barfing.
It also defines some utilities in distributed/ that are not specific to Lighthouse or Flux, and which should be moved somewhere else eventually.
This is the companion PR to a beacon-internal project.
It defines a
DistributedClassifier
with an impl oftrain!
that sends batch specs (whatever info workers need to know what their next batch is) to workers, where losses and gradients get computed and sent back. The driver node performs parameter updates after summing the gradients from all workers. This is a purely synchronous distributed training loop: the workers are always running with the latest version of the model, which ensures that the model converges and performs the same as it would if it were trained locally (there are no such guarantees for asynchronous training schemes where workers are often running with stale model params.)It also defines a
DistributedLogger
allowing workers to send back logs to the driver node. This hasn't been tested, beyond it not barfing.It also defines some utilities in
distributed/
that are not specific toLighthouse
orFlux
, and which should be moved somewhere else eventually.