Open ashleve opened 2 years ago
Hey @ashleve,
The code in Lightning to enable DDP with Hydra is there: https://github.com/PyTorchLightning/pytorch-lightning/blob/7fa1aebcc99297e4d7eb8dcf2deb22e6da814edf/pytorch_lightning/strategies/ddp.py#L231.
Would you be interested in investigating this behavior with sweep and creating a bug fix PR if you manage to solve it.?
Otherwise, would you mind providing a reproducible script with the BoringModel + Hydra ?
Alternatively, did you try using Trainer(strategy="ddp_spawn")
?
Best, T.C
@tchaton Hey, here's a minimal example:
https://github.com/ashleve/lit-hydra-ddp-multirun-bug
Run multirun with python main.py -m x=0,1
I was not able to find an easy fix, but here's what I found:
trainer.test()
, I suspect the cause might be incorrect ckpt path16-27-40
, 16-27-42
. Both of those were generated by single multirun, but there should be only one main folder with multiple subfolders named by job number: 0
,1
,2
....
Seems like each DDP process causes hydra to spawn extra multirun.dirpath
in checkpoint callback makes the trainer.test()
execute without issues, but multiple folders still remain. I brought it up in #2727, what has mostly been working for me is to change the lines @tchaton mentioned above to:
if _HYDRA_AVAILABLE:
if HydraConfig.initialized():
cwd = get_original_cwd()
os_cwd = f'"{os.getcwd()}"' # this is needed to handle characters like `=` in the directory name
command = command[:2]
command += ["-cp", str(Path.cwd().relative_to(Path(cwd)) / ".hydra"), "-cn", "config.yaml"]
command += [f"hydra.run.dir={os_cwd}", f"hydra.job.name=train_ddp_process_{local_rank}"]
Dear @jgbos,
Would you be willing to make a PR to resolve this bug ?
@tchaton I'll think about it. I'm running into some issue with multirun that I need to debug, which brings me to the actual challenge of this PR... is there a good way to test Hydra+PL? It's still a bit hacky to assume the config.yaml
file is in the .hydra
directory (which is the default behavior). Maybe do a search with Path.glob
for the config file? Again, I'll think about it.
FYI, if someone wants to fix this before I get to it, I think this code should work (needs testing though):
if _HYDRA_AVAILABLE:
if HydraConfig.initialized():
orig_cwd = get_original_cwd()
cwd = os.getcwd()
os_cwd = f'"{cwd}"' # this is needed to handle characters like `=` in the directory name
hydra_cfg = HydraConfig.get()
hydra_output = os.path.join(os.path.relpath(cwd, orig_cwd), hydra_cfg.output_subdir)
command = command_no_args # command[:2] or command[:3] if using "python -m script.main"
command += ["-cp", hydra_output, "-cn", "config.yaml"]
command += [f"hydra.run.dir={os_cwd}", f"hydra.job.name=train_ddp_process_{local_rank}"]
Hey @jgbos,
Great question. I think the simplest is to create a simple test with the config file, not in the right place and see if you can recover from it.
Best, T.C
@tchaton True, I'm also thinking about PL+Hydra properties to test. There are going to be a few edge cases my solution doesn't support. I would rather PL support something like python -m pytorch_ligntning.main pickled_data.pk
for launching their subprocesses. That way the subprocess would work independent of how a user launches a job. Not sure how that would be implemented though. Tough problem to solve nicely.
Yes, this needs to be revisited :) Feel free to open a PR with your solution, so we can iterate on your findings.
hi @ashleve - thanks for creating the minimal repro! that was really helpful.
Sounds like there are two issues here:
hydra.sweep.dir
got created twice somehow in ddp
mode.As for 1, in Hydra 1.2 (the one we are currently working on), we added an option to not changing current working dir. If you run your application with hydra.job.chdir=False
, it should work. We've recently put out a dev release of Hydra 1.2 . You can install with pip install hydra-core --pre --upgrade
in case you want to give that a try.
@jgbos Will #11617 fix this issue?
@carmocca Yes, it should fix this issue.
Hope this issue will be fixed
@OZOOOOOH sorry, I've been swamped on other projects to finish the PR. You can checkout a current solution we have here: https://mit-ll-responsible-ai.github.io/responsible-ai-toolbox/how_to/hydraddp.html.
We had to revert part of this change in #15737. I'm reopening the issue so that we can continue to look for a solution that does not involve changing the current working directory when launching the processes.
I've been working to come up with a solution for this (and more generally). A robust solution would remove Lightnings need to depend on sys.argv
in launching subprocesses. I think the following is possibly a robust solution using the submitit
package, could someone else give it a try and let me know how it goes?
from typing import Optional
import submitit
from submitit.core.core import Executor
class SubmititTrainer:
def __init__(self, executor: Optional[Executor] = None, devices: int = 1, **kwargs):
"""PyTorch Lightning Trainer Wrapped by Submitit
This class does not inherit `Trainer` because I want to support DDP in the notebook.
Parameters
----------
executor: Executor | None (default: None)
The submitit executor, if `None` it uses a default local executor.
devices: int = 1
The devices for `pl.Trainer`
**kwargs: Any
Arguments for `pl.Trainer`
"""
self.kwargs = kwargs
self.devices = devices
if executor is None:
self.executor = submitit.AutoExecutor("submitit_log", cluster="local")
self.executor.update_parameters(gpus_per_node=devices)
def _init_trainer(self):
return pl.Trainer(devices=self.devices, **self.kwargs)
def _fit(self, *args, **kwargs):
trainer = self._init_trainer()
return trainer.fit(*args, **kwargs)
def _test(self, *args, **kwargs):
trainer = self._init_trainer()
return trainer.test(*args, **kwargs)
def _predict(self, *args, **kwargs):
trainer = self._init_trainer()
return trainer.predict(*args, **kwargs)
def fit(self, *args, **kwargs):
return self.executor.submit(self._fit, *args, **kwargs)
def test(self, *args, **kwargs):
return self.executor.submit(self._test, *args, **kwargs)
def predict(self, *args, **kwargs):
return self.executor.submit(self._predict, *args, **kwargs)
from rai_toolbox.mushin.testing.lightning import SimpleLightningModule, SimpleDataModule
dm = SimpleDataModule()
model = SimpleLightningModule()
trainer = SubmititTrainer(strategy="ddp", accelerator="gpu", devices=2, max_epochs=1)
job = trainer.test(model=model, datamodule=dm)
print(f"Started Local Job: {job.job_id}")
🐛 Bug
Running DDP with Hydra multirun ends up with "Killed" error message when launching the second task:
I experience this ONLY when passing the
dirpath
parameter to checkpoint callback:Tested for lightning v1.5.7. I believe this issue wasn't around in one of the previous releases.
This probably has something to do with the way hydra changes working directory for each new run - the directory for storing checkpoints also gets changed. If I remember correctly, there was some workaround implemented in lightning which made DDP possible despite that.
cc @tchaton @rohitgr7 @justusschock @kaushikb11 @awaelchli @akihironitta