Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.47k stars 3.39k forks source link

Checkpointing primitives for Fabric #14816

Closed awaelchli closed 1 year ago

awaelchli commented 2 years ago

🚀 Feature

Add Lite.save_checkpoint and Lite.load_checkpoint convenience methods.

Motivation

It is cumbersome to manually construct a checkpoint dict with all metadata and states.

Pitch

Saving checkpoints:

ckpt = self.create_checkpoint(model1, model2, ..., optimizer1, optimizer2, ..., key1=value1, key2=value2)
self.save(ckpt, "path/to/ckpt.pt")
  1. Creates a dict and fetches the state dicts of all objects passed in (instance check)
  2. Depending on strategy, consolidates optimizer state etc.
  3. User-defined metadata can be passed in
  4. Adds version information
  5. The checkpoint creation and saving is separated to give the user control to modify contents if they need to

Loading checkpoints:

ckpt = self.load("path/to/ckpt.pt")
self.apply_checkpoint(ckpt, model1, model2, ..., optimizer1, optimizer2)

# if you need to access your metadata:
val1 = ckpt["key1"]
  1. User loads the file
  2. User has model and optimizers instantiated
  3. Applies the checkpoint to the objects. The state dict contents get applied to the objects (model, optimizers, etc.)
  4. The checkpoint loading from file and application to models is separated to give the user control to modify contents before they get loaded

Open questions

Alternatives

The current way. Constructing the dicts manually and saving/loading using the self.save/self.load helpers.


If you enjoy Lightning, check out our other projects! âš¡

ananthsub commented 2 years ago

You might find this library useful for such primitives, especially to support distributed checkpointing: https://github.com/pytorch/torchsnapshot

@yifuwang

tchaton commented 2 years ago

Hey @ananthsub @yifuwang Would you be interested in making a contribution to Lite ?

awaelchli commented 2 years ago

@ananthsub Thanks, yes we already saw it and the interface is really nice. It could be useful here too to be called under the hood.

In Lite, we also have the CheckpointIO (attached to the strategies) which takes care of the saving and loading, but state dict collection on the objects happens separately. Since torchsnapshot does both, it would have to be integrated differently there.

carmocca commented 2 years ago

Please, let's keep the torchsnapshot integration focused to #14503. It's in our roadmap, just waiting for Lite changes to be over.