Closed awaelchli closed 1 year ago
You might find this library useful for such primitives, especially to support distributed checkpointing: https://github.com/pytorch/torchsnapshot
@yifuwang
Hey @ananthsub @yifuwang Would you be interested in making a contribution to Lite ?
@ananthsub Thanks, yes we already saw it and the interface is really nice. It could be useful here too to be called under the hood.
In Lite, we also have the CheckpointIO (attached to the strategies) which takes care of the saving and loading, but state dict collection on the objects happens separately. Since torchsnapshot does both, it would have to be integrated differently there.
Please, let's keep the torchsnapshot integration focused to #14503. It's in our roadmap, just waiting for Lite changes to be over.
🚀 Feature
Add
Lite.save_checkpoint
andLite.load_checkpoint
convenience methods.Motivation
It is cumbersome to manually construct a checkpoint dict with all metadata and states.
Pitch
Saving checkpoints:
Loading checkpoints:
Open questions
Alternatives
The current way. Constructing the dicts manually and saving/loading using the self.save/self.load helpers.
If you enjoy Lightning, check out our other projects! âš¡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.