Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.35k stars 3.38k forks source link

RFC: Make subprocess DDP the default when selecting multiple devices #14075

Closed awaelchli closed 1 year ago

awaelchli commented 2 years ago

🚀 Feature

RFC: Switch to strategy="ddp" when selecting devices>1.

I'm putting this RFC out so it can be discussed. There is no urgency to this and we don't know what the preferences of the users are. If there is not enough interest, the issue would be closed.

Motivation

accelerator="ddp_spawn" is the default strategy selected by the Trainer when setting Trainer(devices>1, accelerator="cpu"|"gpu). DDP Spawn has two main limitations:

  1. Objects attached to model, trainer, callbacks etc. need to be pickle-able (#14198).
  2. When processes join and program flow returns to the main process, most state that changed in the worker(s) is lost. Model weights, metrics and some attributes on the model checkpoint callback get restored manually.

Regular DDP based on launching subprocesses doesn't have these issues. Historically, ddp_spawn was chosen as the default because it also worked in Jupyter notebooks, but that was before PyTorch switched the default start method. We now have different solutions for running ddp in Jupyter (#13405).

Pitch

Swap the default from ddp_spawn to ddp.

This would be a breaking change and should be considered for 2.0. There is also the possibility that people prefer to run ddp_spawn anyway, in which case we shouldn't make the switch.

Pros:

Cons:

Alternatives

Keep it as is.

Additional context


If you enjoy Lightning, check out our other projects! âš¡

cc @borda @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

rohitgr7 commented 2 years ago

agreed!

justusschock commented 2 years ago

ddp-spawn works virtually everywhere (except for interactive envs), right? Is there any case where subprocess ddp doesn't work? Because IIRC we chose spawn due to it's compatibility with almost every env and I think this should still be our main concern. WE should aim for something that "just works" in 99% of the cases but can ofc be tuned further with more in-depth knowledge on the strategies and the specific environment.

awaelchli commented 2 years ago

ddp-spawn works virtually everywhere (except for interactive envs), right?

Yes, and so does ddp afaik because all it requires is that the script can launch itself as python train.py .... Which I assume would be the case but it is possible that there are other exotic ways of launching a Python program I don't know of :)

awaelchli commented 1 year ago

Resolved in #16780