Add Early Stopping feature

VenkateshDas commented 5 months ago

Implement Early Stopping with PyTorch Lightning

Description

This pull request introduces the Early Stopping feature from PyTorch Lightning to provide flexibility in stopping training runs before completing the total number of epochs. This helps prevent overfitting and can improve model performance.

Changes:

Early Stopping Integration: Added PyTorch Lightning's Early Stopping callback.
Configuration: Included configs/training/early_stop_callbacks.yml to store Early Stopping configuration settings.
.gitignore Modification: Removed configs/ from .gitignore to track the new configuration file.

How to Use

Callback Replacement:

Modify your trainer configuration file (configs/training/default_trainer.yml) to replace the existing callbacks value with configs/training/early_stop_callbacks.yml.

Command-Line Arguments:

Update the Early Stopping arguments directly from the CLI:

python3 -m chebai fit --trainer=configs/training/default_trainer.yml --trainer.callbacks=configs/training/early_stop_callbacks.yml --trainer.callbacks.init_args.monitor=val_loss_epoch --trainer.callbacks.init_args.patience=5

_Note : Please make sure that 'min_epochs' is not provided in the defaulttrainer config to work with the early stopping feature.

Reference for Early Stopping in Pytorch lightning: https://lightning.ai/docs/pytorch/stable/common/early_stopping.html

sfluegel05 commented 5 months ago

Thanks for the detailed description. Two comments:

Should the config file be called earlystop_callbacks.yml or early_stop_callbacks.yml? Both are fine, but you should use the same name in the implementation and comment.
And have you tried this for a training run? Did you need to change some other configs for that as well?

VenkateshDas commented 5 months ago

@sfluegel05

Apologies for the inconsistency in the file name. I have modified the file name for early_stop_callbacks.yml to be same as in the implementation and comment.
Yes, I tried this implementation for the training run and made sure the training was stopped when there was no improvement in the "val_loss_epoch" value. I forgot to mention that min_epochs in the default_trainer have to be commented out to work with the early_stopping feature. I added that in the description and also commented it in the default trainer config.

ChEB-AI / python-chebai