RFC: Make subprocess DDP the default when selecting multiple devices

awaelchli commented 2 years ago

🚀 Feature

RFC: Switch to strategy="ddp" when selecting devices>1.

I'm putting this RFC out so it can be discussed. There is no urgency to this and we don't know what the preferences of the users are. If there is not enough interest, the issue would be closed.

Motivation

accelerator="ddp_spawn" is the default strategy selected by the Trainer when setting Trainer(devices>1, accelerator="cpu"|"gpu). DDP Spawn has two main limitations:

Objects attached to model, trainer, callbacks etc. need to be pickle-able (#14198).
When processes join and program flow returns to the main process, most state that changed in the worker(s) is lost. Model weights, metrics and some attributes on the model checkpoint callback get restored manually.

Regular DDP based on launching subprocesses doesn't have these issues. Historically, ddp_spawn was chosen as the default because it also worked in Jupyter notebooks, but that was before PyTorch switched the default start method. We now have different solutions for running ddp in Jupyter (#13405).

Pitch

Swap the default from ddp_spawn to ddp.

This would be a breaking change and should be considered for 2.0. There is also the possibility that people prefer to run ddp_spawn anyway, in which case we shouldn't make the switch.

Pros:

Addresses the shortcomings of ddp spawn as explained above
When familiar with ddp, it is more natural to switch to other strategies like deepspeed, bagua, collaborative, ...

Cons:

If processes get stuck (a subset hangs or errors), you might end up with zombies and have to kill them manually. This doesn't happen so easily in ddp_spawn, because killing the main process is usually enough to kill all of them.

Alternatives

Keep it as is.

Additional context

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.

cc @borda @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7

rohitgr7 commented 2 years ago

agreed!

justusschock commented 2 years ago

ddp-spawn works virtually everywhere (except for interactive envs), right? Is there any case where subprocess ddp doesn't work? Because IIRC we chose spawn due to it's compatibility with almost every env and I think this should still be our main concern. WE should aim for something that "just works" in 99% of the cases but can ofc be tuned further with more in-depth knowledge on the strategies and the specific environment.

awaelchli commented 2 years ago

ddp-spawn works virtually everywhere (except for interactive envs), right?

Yes, and so does ddp afaik because all it requires is that the script can launch itself as python train.py .... Which I assume would be the case but it is possible that there are other exotic ways of launching a Python program I don't know of :)

awaelchli commented 1 year ago

Resolved in #16780

Lightning-AI / pytorch-lightning