Noble-Lab / casanovo

De Novo Mass Spectrometry Peptide Sequencing with a Transformer Model
https://casanovo.readthedocs.io
Apache License 2.0
102 stars 36 forks source link

Implement early stopping / validation patience interval. #375

Open Lilferrit opened 3 weeks ago

Lilferrit commented 3 weeks ago

This is another QOL feature I implemented for the sake of my own experiments, but that might be nice add to the mainline Casanovo release. I added a new config option val_patience_interval that defaults to -1 (to mirror the functionality of max_epochs), but if val_patience_interval is set to a positive value then an early stopping callback is added to the model runner using PyLightning's EarlyStopping callback. This callback will monitor valid_CELoss and will stop model training if the valid_CELoss doesn't improve for val_patience_interval.

My implementation is on the branch val-early-stop. I also changed the best validation checkpoint filename from <root>.best.ckpt to <root>.<epoch>-<step>.best.ckpt. If we want to implement add the early stopping feature, but we don't want to change the best filename, I can remove this before submitting a PR.

bittremieux commented 3 weeks ago

My implementation is on the branch val-early-stop. I also changed the best validation checkpoint filename from <root>.best.ckpt to <root>.<epoch>-<step>.best.ckpt. If we want to implement add the early stopping feature, but we don't want to change the best filename, I can remove this before submitting a PR.

I don't think that this is an ideal change. The reasoning behind the best.ckpt file was that its filename would always be the same, so that the user can immediately get it. Adding the epoch number removes this advantage.

While adding the early stopping patience is a small change that can make training a bit more convenient, one thing to make sure in your implementation is that it is defined in terms of the number of training steps, not epochs. When we're training on the full MassIVE-KB data, there is convergence even before a full epoch has been processed. Hence also why val_check_interval and some other training options are defined in terms of the number of steps.