Open arjunagarwal899 opened 5 months ago
@lantiga @Borda @tchaton @awaelchli @justusschock Any feedback on this?
Thanks for the interest in this @arjunagarwal899 It's not a bug because Lightning assumes this setting in several places with good intention. The docs and examples never show any other setting, so it's definitely not expected to work. The change you propose in the PR is unfortunately not enough (it may be enough in your case). More places would need updates, including thorough testing and extending tests (feel free to give it a try). We won't have bandwidth to help with this in the near future, but if the community is willing to put the effort into it, we'd be happy to review PRs for it of course!
Other issues: #19898, #14078
@awaelchli @arjunagarwal899 I am facing similar issue. I do not have the capacity to write this PR, but I may be able to help testing it.
Bug description
The default
LightningEnvironment
assumes that every node in a multi-node environment has equal number of GPUs i.e. each node assumes that the world size is equal to number of nodes multiplied by the number of (active) devices on that node.However, implementing one's own environment can bypass this limitation (example attached below). While the processes get registered successfully, the attribute
num_replicas
that is provided to theDistributedSampler
class is still initialized independently of the environment, which leads to an error of having ranks outside the scope of the world size.Fix: Use
num_replicas=self.world_size()
instead of estimating the world size again.What version are you seeing the problem on?
v2.2
How to reproduce the bug
Config and custom Environment:
Trainer:
Run on:
Error messages and logs
Error on node 1:
num_replicas
gets set to 4 here asnum_nodes=2
andnum_processes=2
. However world size is 6 as defined in the environment.Environment
Current environment
``` * Lightning: - efficientnet-pytorch: 0.7.1 - lightning: 2.2.5 - lightning-cloud: 0.5.61 - lightning-utilities: 0.10.0 - pytorch-lightning: 2.1.2 - pytorchvideo: 0.1.5 - torch: 2.2.2 - torchaudio: 2.2.2 - torchmetrics: 1.2.1 - torchsummary: 1.5.1 - torchvision: 0.17.2 ```More info
The issue can be fixed by replacing ddp.py:L137
with
cc @borda @awaelchli @justusschock