Closed JustinBakerMath closed 1 year ago
After revisiting this: checkpoint_freq
and min_perf_metric
aren't ideally compatible.
If the goal is to use min_perf_metric
to checkpoint, then there should be a warmup rather than a frequency.
I've refactored this, replacing checkpoint_freq
with checkpoint_warmup
for a warmup to checkpointing.
Two print lines are also included
@jychoi-hpc
Hi Jong. Justin implemented a checkpoint restart which saves a copy of the model when the validation loss function gets reduced at a new iteration. A warm-up waiting is applied at earlier iterations of the training to avoid saving too many models at the beginning. I was wondering if the PR may cause too many bottlenecks with significant slow-down for training. Any thoughts on this?
Basic checkpointing for training based on the synchronized
validation_loss
performance metric.Can be turned on by adding
with additional
checkpoint_freq
argument for the frequency of checkpointing (default=10
)