OpenMined / SwiftSyft

The official Syft worker for iOS, built in Swift
Apache License 2.0
50 stars 17 forks source link

Training stop/resume and checkpointing #173

Open vvmnnnkv opened 4 years ago

vvmnnnkv commented 4 years ago

Feature Description

With Training API (#172 ) in place, we can add ability to stop training and save intermediate training info to resume training later.

// Start the training
// Training object would contain current epoch, batch, modelParameters
training = Job.train(...)

Suggested API:

// Stop training
training.stop()

New events in Job.train: 'stop'

// User-defined serialization (serialize/unserialize/storage is up to user)
serialized_checkpoint = serialize(training)
unserialized_checkpoint = unserialize(serialized_checkpoint)

// Supplying checkpoint back to Job.train
training = Job.train(trainingPlan, {
   ...
   checkpoint: unserialized_checkpoint
})

Training loop should read properties of checkpoint and load model params, epoch, step, batchSize, etc. from it.

What alternatives have you considered?

API was discussed in FL team.

Additional Context

See #172