Open awaelchli opened 2 years ago
cc @otaj who was working on Join I think :)
Is this proposal relevant to Join
only? Or should we instead tackle it from the wider perspective of https://github.com/Lightning-AI/lightning/issues/7534?
Join would be DDP only, but this idea:
they won't have to change it again when switching to single-device strategy (it would simply be a no-op).
should apply to all collective calls.
Oh, yes, this is a great idea! However, I gotta agree with @carmocca that it might be better to have this applied to all collective calls. Maybe even have something like lightning.lite.distributed
module/package which would contain all of these calls
Related https://github.com/Lightning-AI/lightning/issues/13821: DeepSpeed did something similar.
🚀 Feature
Provide Join through an intuitive API in LightningLite and make it backend agnostic, i.e., switching from DDP to single-device and vice versa should not require changes to the code.
Motivation
The DDP Join context manager in PyTorch allows you to run your loops with different number of items on each rank, without getting out of sync issues and hangs in collective calls. PyTorch calls this "uneven inputs". Normally, the DistributedSampler would "even out" the data on each rank by inserting fake, repeated data.
Pitch
Provide Join in LightningLite. More specifically, through the Strategy. The idea here is that once the user added the join to their loop, they won't have to change it again when switching to single-device strategy (it would simply be a no-op).
Note, by default Lite will auto-insert a DistributedSampler on the dataloader for the user. The tricky part here is that Join is only useful if you set
drop_last=False
in the sampler. How do we link the two features together so that they work in a meaningful way?Alternatives
Do not introduce this. The user can just use the raw PyTorch APIs.
Additional context
Once this lands in Lite, the PL strategies can also make use of it in their implementations. This can be developed in parallel to #3325.
If you enjoy Lightning, check out our other projects! âš¡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.
cc @borda @carmocca @justusschock @awaelchli