Doodleverse / segmentation_gym

A neural gym for training deep learning models to carry out geoscientific image segmentation. Works best with labels generated using https://github.com/Doodleverse/dash_doodler
MIT License
45 stars 11 forks source link

Add config option `initial_epoch` to restore model checkpoint and position in the LR scheduler #110

Closed dbuscombe-usgs closed 1 year ago

dbuscombe-usgs commented 1 year ago

When things go afoul during model training, for example a powercut, memory leak, or other unexpected issue that interrupts lengthy training, there is currently no way to restore model training

HOT_START could be used to restore model weights and resume training from the beginning epoch, however the LR scheduler will start again at the beginning, thus negating the point of the LR scheduler. In fact restarting the model with refined weights without modifying the LR scheduler could create unwanted model convergence issues

To avoid this situation, the code could be modified as follows:

  1. add new parameter INITIAL_EPOCH to the config file
  2. If absent, it would default to zero
  3. model.fit would use INITIAL_EPOCH as argument to the initial_epoch parameter
  4. if HOT_START is specified but INITIAL_EPOCH, the program should exit with a message for the user
dbuscombe-usgs commented 1 year ago

keras' model.fit options listed here include the description for initial_epoch https://keras.io/api/models/model_training_apis/

should be a straightforward fix

dbuscombe-usgs commented 1 year ago

the one downside I see is that the full training history for the model, currently provided in the output file ..model_history.npz, would not be available. It is only created after successful cessation of model training. I do not see a workaround, however ....

dbuscombe-usgs commented 1 year ago

Implemented in https://github.com/Doodleverse/segmentation_gym/commit/809466a3edf097674504fc8847f82ffc70cdc2fa

Leaving open to add to wiki docs

dbuscombe-usgs commented 1 year ago

now added to wiki

closing