Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.47k stars 3.39k forks source link

Add trainer flag max_time_per_run #10226

Open ericharper opened 3 years ago

ericharper commented 3 years ago

🚀 Feature

Add a max_time_per_run flag to trainer. Currently there is a max_time flag: https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#max-time . This is global training time which is not helpful in this case.

Motivation

When training on large GPU clusters with time limits, it's important to be able to stop training after a specified time. For example, assume the cluster has 4 hour time limits for jobs. If we are training a large model, it's possible that the job will be killed while writing a checkpoint to disk, resulting in a corrupted checkpoint.

Pitch

If we can configure max_time_per_run, we can help ensure that our job will terminate more gracefully. Preventing things like corrupted checkpoints during training.

Alternatives

We've implemented our own solution in this PR: https://github.com/NVIDIA/NeMo/pull/3056

But this seems like a useful feature that anyone using PTL on a cluster with time limits will be able to benefit from.

Additional context


If you enjoy Lightning, check out our other projects! âš¡

cc @borda @tchaton @justusschock @awaelchli @kaushikb11 @rohitgr7

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!