CUNY-CL / yoyodyne

Small-vocabulary sequence-to-sequence generation with optional feature conditioning
Apache License 2.0
25 stars 15 forks source link

Add `save_best` flag to control checkpointing #169

Closed michaelpginn closed 2 months ago

michaelpginn commented 2 months ago

There are cases where the user may not want to save checkpoints based on the dev accuracy. For example, if the training process has very low or unstable accuracy, using this metric to select checkpoints can result in selecting a suboptimal final model.

This PR simply adds a flag (--save_best and --no_save_best) to enable naive saving, where a new checkpoint is saved every epoch. The default behavior remains the same and the naming convention follows other flags.

michaelpginn commented 2 months ago

Hmm, the black format and flake8 rules seem to be in conflict as to whether long string lines should be broken. What's your preference?

Adamits commented 2 months ago

Quick question here:

lightning let's yo evaluate/log either every n epochs, or every n steps IIRC. It might be logical to expect --save_all to save a checkpoint each time we run evaluation. Might there be a way of configuring the callback to do this instead?

kylebgorman commented 2 months ago

Hmm, the black format and flake8 rules seem to be in conflict as to whether long string lines should be broken. What's your preference?

Black just doesn't know how to break string literals. You have to do it yourself (then call black one more time to make sure it's happy with how you do it).

Adamits commented 2 months ago

Also

Another thing we could consider is to make early stopping configurable so that you can stop based on maximize validation accuracy or minimizing validation loss. That might address some of the same concerns, and it's a research question we might want to consider someday @Adamits.

Yeah we should definitely do this. Its especially useful when pretraining (where I expect we care about loss, not accuracy). I wonder if we get this already somehow in the lightning CLI interface? If not now, maybe once we upgrade to 2.0 (somehow I think I was supposed to do this > a year ago :D).

michaelpginn commented 2 months ago

Also

Another thing we could consider is to make early stopping configurable so that you can stop based on maximize validation accuracy or minimizing validation loss. That might address some of the same concerns, and it's a research question we might want to consider someday @Adamits.

Yeah we should definitely do this. Its especially useful when pretraining (where I expect we care about loss, not accuracy). I wonder if we get this already somehow in the lightning CLI interface? If not now, maybe once we upgrade to 2.0 (somehow I think I was supposed to do this > a year ago :D).

Our project was also interested in using an alternative metric (chrF in our case). I would be happy to explore whether this is something that can be generalized robustly with lightning, if you like!

kylebgorman commented 2 months ago

Also

Another thing we could consider is to make early stopping configurable so that you can stop based on maximize validation accuracy or minimizing validation loss. That might address some of the same concerns, and it's a research question we might want to consider someday @Adamits.

Yeah we should definitely do this. Its especially useful when pretraining (where I expect we care about loss, not accuracy). I wonder if we get this already somehow in the lightning CLI interface? If not now, maybe once we upgrade to 2.0 (somehow I think I was supposed to do this > a year ago :D).

See #170 for this.

kylebgorman commented 2 months ago

Our project was also interested in using an alternative metric (chrF in our case). I would be happy to explore whether this is something that can be generalized robustly with lightning, if you like!

What's chrF?

michaelpginn commented 2 months ago

What's chrF?

Essentially character-level bleu score (https://aclanthology.org/W15-3049/). Can potentially help when the data is very limited and accuracy is near 0, but the predictions may have correct substrings.

kylebgorman commented 2 months ago

What's chrF?

Essentially character-level bleu score (https://aclanthology.org/W15-3049/). Can potentially help when the data is very limited and accuracy is near 0, but the predictions may have correct substrings.

I thought that's what it might mean. We'd welcome a PR to add that.

Adamits commented 2 months ago

LGTM. @Adamits shall I merge?

LGTM.