Closed dwro0121 closed 2 years ago
Hello @dwro0121,
To clarify:
last_checkpoint.yaml
: After the end of each epoch: save the current checkpoint as latest-{epoch}.ckpt
(and delete the previous one). At the end of the training, save the last one as last.ckpt
.latest_checkpoint
: Every X
epochs (200 by default), save the checkpoint (and keep all others).Actually, about the config files, it does not matter so much: the default behaviour withmonitor: None
is to save the last checkpoint.
I will remove monitor: step
and mode: max
from latest_checkpoint
which do the same thing, it will be more clear. Thanks for pointing this out to me.
To simplify things, the best model is the last.ckpt
, the checkpoint after a full training. I am not using the validation metric/loss to choose the best model.
Questions have been resolved. Thank you for the reply.
Hi, thanks for your great work. I have a question about checkpoints.
I saw config files, and I can find that you used
mode=max
inlatest_checkpoint.yaml
, but I can't find it inlast_checkpoint.yaml
. so if you used the same metrics for them, I think we need to remove it fromlatest_checkpoint.yaml
. (If use error or loss for metrics)How do you think about this?
Additionally, I want to know which one is the best model. Is the
last.ckpt
the best model with metrics (valid error or loss)?thanks.