About checkpoints - Githubissues

dwro0121 commented 2 years ago

Hi, thanks for your great work. I have a question about checkpoints.

I saw config files, and I can find that you used mode=max in latest_checkpoint.yaml, but I can't find it in last_checkpoint.yaml. so if you used the same metrics for them, I think we need to remove it from latest_checkpoint.yaml. (If use error or loss for metrics)

How do you think about this?

Additionally, I want to know which one is the best model. Is the last.ckpt the best model with metrics (valid error or loss)?

thanks.

Mathux commented 2 years ago

Hello @dwro0121,

To clarify:

last_checkpoint.yaml: After the end of each epoch: save the current checkpoint as latest-{epoch}.ckpt (and delete the previous one). At the end of the training, save the last one as last.ckpt.
latest_checkpoint: Every X epochs (200 by default), save the checkpoint (and keep all others).

Actually, about the config files, it does not matter so much: the default behaviour withmonitor: None is to save the last checkpoint. I will remove monitor: stepand mode: max from latest_checkpoint which do the same thing, it will be more clear. Thanks for pointing this out to me.

To simplify things, the best model is the last.ckpt, the checkpoint after a full training. I am not using the validation metric/loss to choose the best model.

dwro0121 commented 2 years ago

Questions have been resolved. Thank you for the reply.

Mathux / TEMOS

About checkpoints #5