Add config option `initial_epoch` to restore model checkpoint and position in the LR scheduler

dbuscombe-usgs commented 1 year ago

When things go afoul during model training, for example a powercut, memory leak, or other unexpected issue that interrupts lengthy training, there is currently no way to restore model training

HOT_START could be used to restore model weights and resume training from the beginning epoch, however the LR scheduler will start again at the beginning, thus negating the point of the LR scheduler. In fact restarting the model with refined weights without modifying the LR scheduler could create unwanted model convergence issues

To avoid this situation, the code could be modified as follows:

add new parameter INITIAL_EPOCH to the config file
If absent, it would default to zero
model.fit would use INITIAL_EPOCH as argument to the initial_epoch parameter
if HOT_START is specified but INITIAL_EPOCH, the program should exit with a message for the user

dbuscombe-usgs commented 1 year ago

keras' model.fit options listed here include the description for initial_epoch https://keras.io/api/models/model_training_apis/

should be a straightforward fix

dbuscombe-usgs commented 1 year ago

the one downside I see is that the full training history for the model, currently provided in the output file ..model_history.npz, would not be available. It is only created after successful cessation of model training. I do not see a workaround, however ....

dbuscombe-usgs commented 1 year ago

Implemented in https://github.com/Doodleverse/segmentation_gym/commit/809466a3edf097674504fc8847f82ffc70cdc2fa

Leaving open to add to wiki docs

dbuscombe-usgs commented 1 year ago

now added to wiki

closing

Doodleverse / segmentation_gym

Add config option `initial_epoch` to restore model checkpoint and position in the LR scheduler #110