With Training API (#172 ) in place, we can add ability to stop training and save intermediate training info to resume training later.
// Start the training
// Training object would contain current epoch, batch, modelParameters
training = Job.train(...)
Suggested API:
// Stop training
training.stop()
New events in Job.train: 'stop'
// User-defined serialization (serialize/unserialize/storage is up to user)
serialized_checkpoint = serialize(training)
unserialized_checkpoint = unserialize(serialized_checkpoint)
// Supplying checkpoint back to Job.train
training = Job.train(trainingPlan, {
...
checkpoint: unserialized_checkpoint
})
Training loop should read properties of checkpoint
and load model params, epoch, step, batchSize, etc. from it.
Feature Description
With Training API (#172 ) in place, we can add ability to stop training and save intermediate training info to resume training later.
Suggested API:
New events in Job.train:
'stop'
Training loop should read properties of checkpoint and load model params, epoch, step, batchSize, etc. from it.
What alternatives have you considered?
API was discussed in FL team.
Additional Context
See #172