Save training metadata with the fault tolerance checkpoint

carmocca commented 3 years ago

🚀 Feature

See title

Motivation

Since we will only guarantee fault-tolerance restart for the same number of GPUs and workers (among others), we might want to save metadata about those for archival and error checking.

Pitch

Add extra fields to the checkpoint generated in _on_exception

https://github.com/PyTorchLightning/pytorch-lightning/blob/9d62f248476c6358d8707188f7b20fafa79f8a4f/pytorch_lightning/trainer/trainer.py#L1376-L1381

Additional context

This kind of training "metadata" should get saved with the checkpoint. For example, we will also want to know this for fault-tolerance to fail if the trainer configuration has changed between runs and the user is trying to restore mid-batch.

Originally posted by @carmocca in https://github.com/PyTorchLightning/pytorch-lightning/pull/8515#r677317201

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

carmocca commented 1 year ago

Another improvement that's not just useful for fault tolerance would be to save the trainer arguments passed

Lightning-AI / pytorch-lightning