Closed awaelchli closed 2 years ago
I think this fall-back logic is a bit faulty. I would prefer to do (A) and even update the previous logic to also error-out.
@awaelchli and @carmocca Quick question you want to fallback to option (A) correct? I.e raise a warning and set strategy to DDP, right? Do you have any specific warning in mind ? If you could divulge some details I can write a fix and create a PR :)
I can take this up if this is open :)
@Atharva-Phatak I show the fallback logic in the description above, LSFEnvironment needs to be added to these checks there. That's the quick fix to align the logic with other environments where we fall back to "ddp".
I wouldn't say that the logic is faulty, just brittle, and easy to miss updating when adding a new environment.
I would definitely go for the quick fix (fallback) first (for both Trainer and Lite). Then, the error handling can be investigated.
@awaelchli Essentially right now we just set strategy to DDP right? So we need to implement LSFEnvironment
class something like TorchElasticEnvironment
right ?
There is nothing to implement really. All we have to do is add the check here as I pointed out above: https://github.com/Lightning-AI/lightning/blob/d2840a20bd27a0a86fc59ba54d94543e63523133/src/pytorch_lightning/trainer/connectors/accelerator_connector.py#L616-L619
We just add or LSFEnvironment.detect()
and extend the existing test.
@awaelchli Thanks for clearing that I though LSFEnironment
was not implemented :p
@awaelchli I am adding the as you mentioned. I also understand that I would have to @mock.patch.dict
environment variables.
Could you verify these environ params and let me know if they are right ? I am not sure of couple of them. If you could help me out that would be amazing.
@mock.patch.dict(
os.environ,
{
"LSB_JOBID" : "1",
"LSB_DJOB_RANKFILE" : "", #Not sure about this.
"JSM_NAMESPACE_LOCAL_RANK" : "1",
"JSM_NAMESPACE_SIZE" : "20",
"JSM_NAMESPACE_RANK" : "1" #Not sure about this.
}
)
Yes, looks ok. These are the env variables that need to be set such that LSFEnvironment.detect()
returns True, which is useful for such a test.
@awaelchli Perfect, so its okay if I keep LSB_DJOB_RANKFILE
as ""
?
I encourage you to try to study the code, write the test according to your expectations, and run it to see what happens. In this instance, the best way to learn is to do it :)
If it doesn't work, I recommend to open the PR with what you have and we will help out from there :)
🚀 Bug
DDP-spawn on multi-node configurations where processes get spawned externally should not be supported.
To Reproduce
Running with
on LSF is currently faulty but should not be supported.
Expected behavior
Should not be supported. Two options:
A) Raise an error and suggest to use
strategy="ddp"
. B) Fall-back to DDP. A similar check already exists for SLURM:https://github.com/Lightning-AI/lightning/blob/d2840a20bd27a0a86fc59ba54d94543e63523133/src/pytorch_lightning/trainer/connectors/accelerator_connector.py#L616-L619
I suggest to go with B) as we already do it with the other environments.
Additional context
Originally reported on Slack.
If you enjoy Lightning, check out our other projects! âš¡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging PyTorch Lightning, Transformers, and Hydra.