Open carmocca opened 3 years ago
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!
Another improvement that's not just useful for fault tolerance would be to save the trainer arguments passed
🚀 Feature
See title
Motivation
Since we will only guarantee fault-tolerance restart for the same number of GPUs and workers (among others), we might want to save metadata about those for archival and error checking.
Pitch
Add extra fields to the checkpoint generated in
_on_exception
https://github.com/PyTorchLightning/pytorch-lightning/blob/9d62f248476c6358d8707188f7b20fafa79f8a4f/pytorch_lightning/trainer/trainer.py#L1376-L1381
Additional context
Originally posted by @carmocca in https://github.com/PyTorchLightning/pytorch-lightning/pull/8515#r677317201
If you enjoy Lightning, check out our other projects! âš¡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.