Can sharded be resumed with a different number of devices?

🚀 Feature

Does this work?

model = BoringModel()
trainer = Trainer(strategy="ddp_sharded_spawn", fast_dev_run=True, gpus=2)
trainer.fit(model)

checkpoint_path = os.path.join(tmpdir, "model.pt")
trainer.save_checkpoint(checkpoint_path)

model = BoringModel()
trainer = Trainer(strategy="ddp_sharded_spawn", fast_dev_run=True, gpus=1)
trainer.fit(model, ckpt_path=checkpoint_path)

Motivation

We had a legacy test skipped in our CI mentioning that this is unsupported.

Pitch

We should check, and raise an appropriate error if that's the case.

Additional context

https://github.com/Lightning-AI/lightning/pull/14476/files/38e10ba837dad423ecaa52f300609083de379e19#r960749044

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.

cc @tchaton @rohitgr7 @borda @akihironitta @awaelchli

Lightning-AI / pytorch-lightning

Can sharded be resumed with a different number of devices? #14485

🚀 Feature

Motivation

Pitch

Additional context

If you enjoy Lightning, check out our other projects! ⚡