Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.47k stars 3.39k forks source link

DDP-spawn on LSF cluster (multi-node) should not be supported #15103

Closed awaelchli closed 2 years ago

awaelchli commented 2 years ago

🚀 Bug

DDP-spawn on multi-node configurations where processes get spawned externally should not be supported.

To Reproduce

Running with

Trainer(strategy="ddp_spawn", num_nodes>1, devices=...)

on LSF is currently faulty but should not be supported.

Expected behavior

Should not be supported. Two options:

A) Raise an error and suggest to use strategy="ddp". B) Fall-back to DDP. A similar check already exists for SLURM:

https://github.com/Lightning-AI/lightning/blob/d2840a20bd27a0a86fc59ba54d94543e63523133/src/pytorch_lightning/trainer/connectors/accelerator_connector.py#L616-L619

I suggest to go with B) as we already do it with the other environments.

Additional context

Originally reported on Slack.


If you enjoy Lightning, check out our other projects! âš¡

carmocca commented 2 years ago

I think this fall-back logic is a bit faulty. I would prefer to do (A) and even update the previous logic to also error-out.

Atharva-Phatak commented 2 years ago

@awaelchli and @carmocca Quick question you want to fallback to option (A) correct? I.e raise a warning and set strategy to DDP, right? Do you have any specific warning in mind ? If you could divulge some details I can write a fix and create a PR :)

I can take this up if this is open :)

awaelchli commented 2 years ago

@Atharva-Phatak I show the fallback logic in the description above, LSFEnvironment needs to be added to these checks there. That's the quick fix to align the logic with other environments where we fall back to "ddp".

I wouldn't say that the logic is faulty, just brittle, and easy to miss updating when adding a new environment.

I would definitely go for the quick fix (fallback) first (for both Trainer and Lite). Then, the error handling can be investigated.

Atharva-Phatak commented 2 years ago

@awaelchli Essentially right now we just set strategy to DDP right? So we need to implement LSFEnvironment class something like TorchElasticEnvironment right ?

awaelchli commented 2 years ago

There is nothing to implement really. All we have to do is add the check here as I pointed out above: https://github.com/Lightning-AI/lightning/blob/d2840a20bd27a0a86fc59ba54d94543e63523133/src/pytorch_lightning/trainer/connectors/accelerator_connector.py#L616-L619

We just add or LSFEnvironment.detect() and extend the existing test.

Atharva-Phatak commented 2 years ago

@awaelchli Thanks for clearing that I though LSFEnironment was not implemented :p

Atharva-Phatak commented 2 years ago

@awaelchli I am adding the as you mentioned. I also understand that I would have to @mock.patch.dict environment variables. Could you verify these environ params and let me know if they are right ? I am not sure of couple of them. If you could help me out that would be amazing.

@mock.patch.dict(
    os.environ,
    {
        "LSB_JOBID" : "1",
        "LSB_DJOB_RANKFILE" : "", #Not sure about this.
        "JSM_NAMESPACE_LOCAL_RANK" : "1",
        "JSM_NAMESPACE_SIZE" : "20",
        "JSM_NAMESPACE_RANK" : "1" #Not sure about this.
    }
)
awaelchli commented 2 years ago

Yes, looks ok. These are the env variables that need to be set such that LSFEnvironment.detect() returns True, which is useful for such a test.

Atharva-Phatak commented 2 years ago

@awaelchli Perfect, so its okay if I keep LSB_DJOB_RANKFILE as ""?

awaelchli commented 2 years ago

I encourage you to try to study the code, write the test according to your expectations, and run it to see what happens. In this instance, the best way to learn is to do it :)

If it doesn't work, I recommend to open the PR with what you have and we will help out from there :)