Add trainer flag max_time_per_run

🚀 Feature

Add a max_time_per_run flag to trainer. Currently there is a max_time flag: https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#max-time . This is global training time which is not helpful in this case.

Motivation

When training on large GPU clusters with time limits, it's important to be able to stop training after a specified time. For example, assume the cluster has 4 hour time limits for jobs. If we are training a large model, it's possible that the job will be killed while writing a checkpoint to disk, resulting in a corrupted checkpoint.

Pitch

If we can configure max_time_per_run, we can help ensure that our job will terminate more gracefully. Preventing things like corrupted checkpoints during training.

Alternatives

We've implemented our own solution in this PR: https://github.com/NVIDIA/NeMo/pull/3056

But this seems like a useful feature that anyone using PTL on a cluster with time limits will be able to benefit from.

Additional context

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @borda @tchaton @justusschock @awaelchli @kaushikb11 @rohitgr7

Lightning-AI / pytorch-lightning