Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.47k stars 3.39k forks source link

Provide a backend agnostic Join for LightningLite #14635

Open awaelchli opened 2 years ago

awaelchli commented 2 years ago

🚀 Feature

Provide Join through an intuitive API in LightningLite and make it backend agnostic, i.e., switching from DDP to single-device and vice versa should not require changes to the code.

Motivation

The DDP Join context manager in PyTorch allows you to run your loops with different number of items on each rank, without getting out of sync issues and hangs in collective calls. PyTorch calls this "uneven inputs". Normally, the DistributedSampler would "even out" the data on each rank by inserting fake, repeated data.

Pitch

Provide Join in LightningLite. More specifically, through the Strategy. The idea here is that once the user added the join to their loop, they won't have to change it again when switching to single-device strategy (it would simply be a no-op).

Note, by default Lite will auto-insert a DistributedSampler on the dataloader for the user. The tricky part here is that Join is only useful if you set drop_last=False in the sampler. How do we link the two features together so that they work in a meaningful way?

Alternatives

Do not introduce this. The user can just use the raw PyTorch APIs.

Additional context

Once this lands in Lite, the PL strategies can also make use of it in their implementations. This can be developed in parallel to #3325.


If you enjoy Lightning, check out our other projects! âš¡

cc @borda @carmocca @justusschock @awaelchli

justusschock commented 2 years ago

cc @otaj who was working on Join I think :)

carmocca commented 2 years ago

Is this proposal relevant to Join only? Or should we instead tackle it from the wider perspective of https://github.com/Lightning-AI/lightning/issues/7534?

Join would be DDP only, but this idea:

they won't have to change it again when switching to single-device strategy (it would simply be a no-op).

should apply to all collective calls.

otaj commented 2 years ago

Oh, yes, this is a great idea! However, I gotta agree with @carmocca that it might be better to have this applied to all collective calls. Maybe even have something like lightning.lite.distributed module/package which would contain all of these calls

carmocca commented 2 years ago

Related https://github.com/Lightning-AI/lightning/issues/13821: DeepSpeed did something similar.