OpenMined / KotlinSyft

The official Syft worker for secure on-device machine learning
https://www.openmined.org
Apache License 2.0
85 stars 28 forks source link

Training stop/resume and checkpointing #265

Open vvmnnnkv opened 4 years ago

vvmnnnkv commented 4 years ago

Feature Description

With Training API (#264 ) in place, we can add ability to stop training and save intermediate training info to resume training later.

// Start the training
// Training object would contain current epoch, batch, modelParameters
training = Job.train(...)

Suggested API:

// Stop training
training.stop()

New events in Job.train: 'stop'

// User-defined serialization (serialize/unserialize/storage is up to user)
serialized_checkpoint = serialize(training)
unserialized_checkpoint = unserialize(serialized_checkpoint)

// Supplying checkpoint back to Job.train
training = Job.train(trainingPlan, {
   ...
   checkpoint: unserialized_checkpoint
})

Training loop should read properties of checkpoint and load model params, epoch, step, batchSize, etc. from it.

What alternatives have you considered?

API was discussed in FL team.

Additional Context

See #264