Open remotejob opened 5 years ago
Do you mean to compute the validation loss for K last epochs and save only the best of them? Or save the checkpoints for all last K epochs?
As I understand
"save_checkpoint_steps": 20001,
will save model only after 20001 steps?
But will be nice to have opportunity save only best model and discard(delete) previous model.
I am try play with your protect using https://www.kaggle.com/
but kaggle have disk space limitation 5.2GB.
Other solution for example OpenNMT-py have option -keep_checkpoint 1 so it keep only last (1) checkpoint
To address the memory problem you can use this parameter: 'num_checkpoints': int, # maximum number of last checkpoints to keep We also support another useful flag, which takes the best saved checkpoint: 'restore_best_checkpoint': bool, # if True, restore best check point instead of latest checkpoint
Currently, OpenSeq2Seq stores N last checkpoints (with specified frequency "save_checkpoint_steps", including the last step checkpoint) and N best checkpoints (in best_models
directory). N can be defined in a config file. For example, if you add "num_checkpoints": 1
, it will keep two checkpoints only: the last one and the best one.
Is it possible save only best or last model during training