Provide a backend agnostic Join for LightningLite

awaelchli commented 2 years ago

🚀 Feature

Provide Join through an intuitive API in LightningLite and make it backend agnostic, i.e., switching from DDP to single-device and vice versa should not require changes to the code.

Motivation

The DDP Join context manager in PyTorch allows you to run your loops with different number of items on each rank, without getting out of sync issues and hangs in collective calls. PyTorch calls this "uneven inputs". Normally, the DistributedSampler would "even out" the data on each rank by inserting fake, repeated data.

Pitch

Provide Join in LightningLite. More specifically, through the Strategy. The idea here is that once the user added the join to their loop, they won't have to change it again when switching to single-device strategy (it would simply be a no-op).

Note, by default Lite will auto-insert a DistributedSampler on the dataloader for the user. The tricky part here is that Join is only useful if you set drop_last=False in the sampler. How do we link the two features together so that they work in a meaningful way?

Alternatives

Do not introduce this. The user can just use the raw PyTorch APIs.

Additional context

Once this lands in Lite, the PL strategies can also make use of it in their implementations. This can be developed in parallel to #3325.

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.

cc @borda @carmocca @justusschock @awaelchli

justusschock commented 2 years ago

cc @otaj who was working on Join I think :)

carmocca commented 2 years ago

Is this proposal relevant to Join only? Or should we instead tackle it from the wider perspective of https://github.com/Lightning-AI/lightning/issues/7534?

Join would be DDP only, but this idea:

they won't have to change it again when switching to single-device strategy (it would simply be a no-op).

should apply to all collective calls.

otaj commented 2 years ago

Oh, yes, this is a great idea! However, I gotta agree with @carmocca that it might be better to have this applied to all collective calls. Maybe even have something like lightning.lite.distributed module/package which would contain all of these calls

carmocca commented 2 years ago

Related https://github.com/Lightning-AI/lightning/issues/13821: DeepSpeed did something similar.

Lightning-AI / pytorch-lightning