Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.17k stars 3.37k forks source link

Integrate torchsnapshot #14503

Open carmocca opened 2 years ago

carmocca commented 2 years ago

🚀 Feature

Integrate https://github.com/pytorch/torchsnapshot

Motivation

The library is design with composition in mind and is very modular. The distributed training benchmarks look very promising, so this would be a great addition to the project.

From their maintainers:

Despite the experimental status, we’ve already been taking backward compatibility very seriously as there are already some early adopter from external companies. In terms of API, the surface area is very small and we do not have any plans for BC-breaking changes. In terms of storage format, we are already committed to being backward compatible. FWIW, the project will go to beta stage late September or early October.

In the future, it could also include other features such as snapshotting the DataLoader state (both for V1 and V2 DataLoaders)

More resources:

Pitch

At this point, it looks like a SnapshotCheckpointIO plugin would be the right mechanism to do it.

Alternatives

Not do it.


If you enjoy Lightning, check out our other projects! âš¡

cc @borda @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj @akihironitta

awaelchli commented 2 years ago

This is great. A checkpoint io plugin would be the best way, I agree.