Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.96k stars 3.35k forks source link

Save training metadata with the fault tolerance checkpoint #9123

Open carmocca opened 3 years ago

carmocca commented 3 years ago

🚀 Feature

See title

Motivation

Since we will only guarantee fault-tolerance restart for the same number of GPUs and workers (among others), we might want to save metadata about those for archival and error checking.

Pitch

Add extra fields to the checkpoint generated in _on_exception

https://github.com/PyTorchLightning/pytorch-lightning/blob/9d62f248476c6358d8707188f7b20fafa79f8a4f/pytorch_lightning/trainer/trainer.py#L1376-L1381

Additional context

This kind of training "metadata" should get saved with the checkpoint. For example, we will also want to know this for fault-tolerance to fail if the trainer configuration has changed between runs and the user is trying to restore mid-batch.

Originally posted by @carmocca in https://github.com/PyTorchLightning/pytorch-lightning/pull/8515#r677317201


If you enjoy Lightning, check out our other projects! âš¡

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

carmocca commented 1 year ago

Another improvement that's not just useful for fault tolerance would be to save the trainer arguments passed