Closed awaelchli closed 1 year ago
agreed!
ddp-spawn works virtually everywhere (except for interactive envs), right? Is there any case where subprocess ddp doesn't work? Because IIRC we chose spawn due to it's compatibility with almost every env and I think this should still be our main concern. WE should aim for something that "just works" in 99% of the cases but can ofc be tuned further with more in-depth knowledge on the strategies and the specific environment.
ddp-spawn works virtually everywhere (except for interactive envs), right?
Yes, and so does ddp afaik because all it requires is that the script can launch itself as python train.py ...
. Which I assume would be the case but it is possible that there are other exotic ways of launching a Python program I don't know of :)
Resolved in #16780
🚀 Feature
RFC: Switch to strategy="ddp" when selecting devices>1.
I'm putting this RFC out so it can be discussed. There is no urgency to this and we don't know what the preferences of the users are. If there is not enough interest, the issue would be closed.
Motivation
accelerator="ddp_spawn"
is the default strategy selected by the Trainer when settingTrainer(devices>1, accelerator="cpu"|"gpu)
. DDP Spawn has two main limitations:Regular DDP based on launching subprocesses doesn't have these issues. Historically, ddp_spawn was chosen as the default because it also worked in Jupyter notebooks, but that was before PyTorch switched the default start method. We now have different solutions for running ddp in Jupyter (#13405).
Pitch
Swap the default from ddp_spawn to ddp.
This would be a breaking change and should be considered for 2.0. There is also the possibility that people prefer to run ddp_spawn anyway, in which case we shouldn't make the switch.
Pros:
Cons:
Alternatives
Keep it as is.
Additional context
If you enjoy Lightning, check out our other projects! âš¡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.
cc @borda @justusschock @kaushikb11 @awaelchli @akihironitta @rohitgr7