Update status in the checkpoint when training hit the last epoch/episode

IBM / mi-prometheus

Enabling reproducible Machine Learning research

http://mi-prometheus.rtfd.io/

Apache License 2.0

42 stars 18 forks source link

Update status in the checkpoint when training hit the last epoch/episode #85

Closed tkornuta-ibm closed 5 years ago

tkornuta-ibm commented 5 years ago

In the case when:

model is not better than previous/saved model AND
status changes (which will happen ONLY when we reached the limits for episode or epoch)

Then we should:

load data from checkpoint
update status to the current status, leaving the model unchanged (?)
save the updated checkpoint

This requires to additionally save training status along with loss in model.

vmarois commented 5 years ago

Yes :+1: A case where this is useful is when the best_model (based on the validation loss criterion) is saved early during the training but it never went under the loss threshold, thus did not actually "converged". In this case, the information model_saved_timestamp and terminal_status_update_timestamp will help the user understand this case

tkornuta-ibm commented 5 years ago

List of possible statuses (for now):

Converged (Full Validation Loss went below Loss Stop threshold) (offline trainer)
Converged (Partial Validation Loss went below Loss Stop threshold) (online trainer)
Not converged (Episode Limit reached)
Not converged (Epoch Limit reached)
Not converged

The last one means:

Not converged (Interrupted by the user) or (Interrupted because other reason) I guess it is enough for now.