Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.47k stars 3.39k forks source link

Can sharded be resumed with a different number of devices? #14485

Closed carmocca closed 2 years ago

carmocca commented 2 years ago

šŸš€ Feature

Does this work?

model = BoringModel()
trainer = Trainer(strategy="ddp_sharded_spawn", fast_dev_run=True, gpus=2)
trainer.fit(model)

checkpoint_path = os.path.join(tmpdir, "model.pt")
trainer.save_checkpoint(checkpoint_path)

model = BoringModel()
trainer = Trainer(strategy="ddp_sharded_spawn", fast_dev_run=True, gpus=1)
trainer.fit(model, ckpt_path=checkpoint_path)

Motivation

We had a legacy test skipped in our CI mentioning that this is unsupported.

Pitch

We should check, and raise an appropriate error if that's the case.

Additional context

https://github.com/Lightning-AI/lightning/pull/14476/files/38e10ba837dad423ecaa52f300609083de379e19#r960749044


If you enjoy Lightning, check out our other projects! āš”

cc @tchaton @rohitgr7 @borda @akihironitta @awaelchli

awaelchli commented 2 years ago

@carmocca It should, because the optimizer state gets consolidated from all ranks before saving. https://github.com/Lightning-AI/lightning/blob/e0c2c3e677d141594cdd799050942b10908c9a97/src/pytorch_lightning/strategies/strategy.py#L176-L183

Adding a test (if we haven't already) can't hurt though.