Open ericharper opened 3 years ago
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!
🚀 Feature
Add a
max_time_per_run
flag to trainer. Currently there is amax_time
flag: https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html#max-time . This is global training time which is not helpful in this case.Motivation
When training on large GPU clusters with time limits, it's important to be able to stop training after a specified time. For example, assume the cluster has 4 hour time limits for jobs. If we are training a large model, it's possible that the job will be killed while writing a checkpoint to disk, resulting in a corrupted checkpoint.
Pitch
If we can configure
max_time_per_run
, we can help ensure that our job will terminate more gracefully. Preventing things like corrupted checkpoints during training.Alternatives
We've implemented our own solution in this PR: https://github.com/NVIDIA/NeMo/pull/3056
But this seems like a useful feature that anyone using PTL on a cluster with time limits will be able to benefit from.
Additional context
If you enjoy Lightning, check out our other projects! âš¡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
cc @borda @tchaton @justusschock @awaelchli @kaushikb11 @rohitgr7