Open jieru-hu opened 2 years ago
@jgbos thank you for taking the time to improve Hydra + PL. Sharing some early thoughts here...
I don't think there's much downside of dumping an extra yaml file in the output dir.
One downside is that resolving could be expensive if the user employs custom resolvers that do heavy computation. See my comment here.
I've been quite busy lately so I haven't had much to time to address the PR. It does look like PL is updating their approach to allow for easy custom launching. Yet, I think this should be even easier for Hydra users. Since we have configs before execution a user can create a pl_main.py
in their repo with:
def task(cfg: DictConfig) -> None:
trainer = instantiate(cfg.trainer)
module = instantiate(cfg.module)
if cfg._ddp_fitting:
trainer.fit(module)
else:
trainer.test(module)
@hydra.main(config_path=None, config_name="config")
def main(cfg):
task(cfg)
if __name__ == "__main__":
main()
and the PL command to launch each DDP subprocess should just be:
python -m my_repo.pl_main -cn config.yaml -cp <path to config.yaml> +_ddp_fitting=TrainerFn.FITTING hydra.run.dir=<cwd>
What are you thoughts on this?
thanks @jgbos for taking a look :)
I'm not exactly sure what is the suggestion here (probably due to my lack of context in this area). Could we chat in zulip or meet and discuss this?
thanks @jgbos @Jasha10 for taking a look. I've added some more thoughts on MULTIRUN above. Pls take a look when you get a chance.
I think the on potential solution is: in PL, we detect the Hydra is in MULTI RUN mode (with the help of https://github.com/facebookresearch/hydra/issues/394), then we can overrider the PL command for subprocess to launch an exact same job (with the help of https://github.com/facebookresearch/hydra/issues/1805)
So #394 means that we will be able to detect multirun mode by inspecting HydraConfig.get()
, right?
Overall the plan sounds good.
Does the following example make sense to you @jieru-hu?
python script.py abc/pqr=xyz foo=0,1 --multirun
foo=0
and one with foo=1
). For this example, let's say the working directories for these jobs are ./multirun/.../0
and ./multirun/.../1
./multirun/.../0/.hydra/resolved.yaml
, and similarly for job 1../multirun/.../0/.hydra
directory so that an identical config can be produced. PL does the same thing for Job 1, referencing ./multirun/.../1/.hydra
.@jieru-hu apologies for the slow response. I've been swamped. Hopefully have more time in the next week. I'll let you know if you still want to chat. As for my comment, it's related to the PL solution with Hydra, not a Hydra solution specifically. It would definitely be interesting to see what a Hydra solution would be.
I think the on potential solution is: in PL, we detect the Hydra is in MULTI RUN mode (with the help of #394), then we can overrider the PL command for subprocess to launch an exact same job (with the help of #1805)
So #394 means that we will be able to detect multirun mode by inspecting
HydraConfig.get()
, right?Overall the plan sounds good.
Does the following example make sense to you @jieru-hu?
- User invokes
python script.py abc/pqr=xyz foo=0,1 --multirun
- This spawns two multirun jobs (one with
foo=0
and one withfoo=1
). For this example, let's say the working directories for these jobs are./multirun/.../0
and./multirun/.../1
- Job 0 should save a resolved copy of the job config at
./multirun/.../0/.hydra/resolved.yaml
, and similarly for job 1.- In job 0, PL should launch the subprocess using some solution from #1805, pointing Hydra to the
./multirun/.../0/.hydra
directory so that an identical config can be produced. PL does the same thing for Job 1, referencing./multirun/.../1/.hydra
.
exactly :)
@jieru-hu apologies for the slow response. I've been swamped. Hopefully have more time in the next week. I'll let you know if you still want to chat. As for my comment, it's related to the PL solution with Hydra, not a Hydra solution specifically. It would definitely be interesting to see what a Hydra solution would be.
sure, sounds good!
So I'm trying to revive https://github.com/PyTorchLightning/pytorch-lightning/pull/11617. Where can I find more info on the result of #1805?
edit
nvm. I found the re-run
documentation.
As of 2023/09/09, running a sweep/multirun of multiple PL training jobs with DDP still results in newly spawned ddp processes (train_ddp_process_1
, etc.) creating multiple output_dirs of the same multirun with multiple copies of multirun.yaml:
...
job:
name: train_ddp_process_1 # and there are copies of multirun.yaml corresponding to `train_ddp_process_2` and so on
chdir: null
override_dirname: sweep=ddp_test,tags=ddp_test,trainer.max_epochs=2
id: ???
num: ???
config_name: train.yaml
env_set: {}
env_copy: []
...
my hydra config:
...
run:
dir: ${paths.log_dir}/${job_name}/runs/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
sweep:
dir: ${paths.log_dir}/${job_name}/multiruns/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
# Sanitize override_dirname by replacing '/' with ',' to avoid unintended subdirectory creation
subdir: ${eval:'"${hydra.job.override_dirname}".replace("/", ".")'}
job_logging:
handlers:
file:
filename: ${hydra.runtime.output_dir}/job.log
...
@jieru-hu I'm willing to look into it. Can you give me some advice?
As of 2023/09/09, running a sweep/multirun of multiple PL training jobs with DDP still results in newly spawned ddp processes (
train_ddp_process_1
, etc.) creating multiple output_dirs of the same multirun with multiple copies of multirun.yaml:... job: name: train_ddp_process_1 # and there are copies of multirun.yaml corresponding to `train_ddp_process_2` and so on chdir: null override_dirname: sweep=ddp_test,tags=ddp_test,trainer.max_epochs=2 id: ??? num: ??? config_name: train.yaml env_set: {} env_copy: [] ...
my hydra config:
... run: dir: ${paths.log_dir}/${job_name}/runs/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags} sweep: dir: ${paths.log_dir}/${job_name}/multiruns/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags} # Sanitize override_dirname by replacing '/' with ',' to avoid unintended subdirectory creation subdir: ${eval:'"${hydra.job.override_dirname}".replace("/", ".")'} job_logging: handlers: file: filename: ${hydra.runtime.output_dir}/job.log ...
@jieru-hu I'm willing to look into it. Can you give me some advice?
Turns out the problem is on the lightning side (https://github.com/Lightning-AI/lightning/blob/1bee50b31d78c66f1182571a2c022356625a8fbb/src/lightning/fabric/strategies/launchers/subprocess_script.py#L175). Lightning fixes hydra.run.dir
for all subproccesses launched for hydra, but that is only used in single run mode, not in sweep/multirun, which uses hydra.sweep.dir
and hydra.sweep.subdir
to configure output_dir. I think changing hydra.run.dir
to hydra.runtime.output_dir
can partially fix this issue, but that creates a minor problem of subprocess multirun.yaml
overwriting the multirun.yaml
created by the main process.
I could create a PR for this. Do you have any other suggestions? @jieru-hu
Thanks for looking into it -- note that @jieru-hu may not be involved anymore with Hydra.
I don't have a good enough understanding of what's happening exactly to provide suggestions, but feel free to submit a PR if you feel like you have a potential solution!
Thanks for looking into it -- note that @jieru-hu may not be involved anymore with Hydra.
I don't have a good enough understanding of what's happening exactly to provide suggestions, but feel free to submit a PR if you feel like you have a potential solution!
Thanks for the info. It would actually be a PR on the Lightning side. Right now my monkey-patching solution is as follows:
def _hydra_subprocess_cmd(local_rank: int):
"""
Monkey patching for lightning.fabric.strategies.launchers.subprocess_script._hydra_subprocess_cmd
Temporarily fixes the problem of unnecessarily creating log folders for DDP subprocesses in Hydra multirun/sweep.
"""
import __main__ # local import to avoid https://github.com/Lightning-AI/lightning/issues/15218
from hydra.core.hydra_config import HydraConfig
from hydra.types import RunMode
from hydra.utils import get_original_cwd, to_absolute_path
# when user is using hydra find the absolute path
if __main__.__spec__ is None: # pragma: no-cover
command = [sys.executable, to_absolute_path(sys.argv[0])]
else:
command = [sys.executable, "-m", __main__.__spec__.name]
command += sys.argv[1:] + [f"hydra.job.name=train_ddp_process_{local_rank}", "hydra.output_subdir=null"]
cwd = get_original_cwd()
config = HydraConfig.get()
if config.mode == RunMode.RUN:
command += [f'hydra.run.dir="{config.run.dir}"']
elif config.mode == RunMode.MULTIRUN:
command += [f'hydra.sweep.dir="{config.sweep.dir}"']
print(command)
return command, cwd
lightning.fabric.strategies.launchers.subprocess_script._hydra_subprocess_cmd = _hydra_subprocess_cmd
lightning.pytorch.strategies.launchers.subprocess_script._hydra_subprocess_cmd = _hydra_subprocess_cmd
This solution pretty much fixes the problem of DDP subprocesses creating unwanted job log folders when running a sweep/multirun with Lightning and DDP training, but one small problem remains as I previously suspected: DDP subprocesses still create multirun.yaml, and the last subproccess' multirun.yaml will overwrite the multirun.yaml created by the main process. It would be nice to allow disabling saving multirun.yaml in subprocesses like DDP (see #2341, #1773, #1937). @omry
I'll update once I create a PR for Lightning.
Update: Joined the discussion on an existing issue for Lightning: https://github.com/Lightning-AI/lightning/pull/18175.
I tried to solve this awhile ago and even got a PR merged, but apparently it broke some other things and got reverted. You can take a look at the code here: https://github.com/Lightning-AI/lightning/pull/11617
I tried to solve this awhile ago and even got a PR merged, but apparently it broke some other things and got reverted. You can take a look at the code here: Lightning-AI/lightning#11617
Hey, thanks for the heads-up. I took a look at the code and the back-and-forths throughout version changes. I agree that "the current method of spawning using sys.argv is not robust". For now, my monkey patch works as a temporary fix for my project, but I'd like to contribute to a better solution if I can. We shall see if we get an update from the creator of https://github.com/Lightning-AI/lightning/pull/18175.
Any update? Is the issue still present?
Hello! Any hacky way of getting ddp and hydra multirun to work? Maybe exiting ddp processes in a particular way? Looking forward to this being solved! (hydra is awesome!)
Some issues I observed from trying out DDP + Hydra + PL (mostly on a single node with multiple GPUs). By no means this is a comprehensive list of existing issues. I want to use this for some initial discussions and see how can we improve from Hydra's side.
Hydra Single run
sort of works with rough edges. PL has to hack around the fact that Hydra changes working dir at job run time (which can be avoided starting Hydra 1.2) https://github.com/PyTorchLightning/pytorch-lightning/blob/2d043857eef55eb327528b70c8b33c680e82fb7b/pytorch_lightning/strategies/launchers/subprocess_script.py#L140-L145
This is definitely not ideal: There's no guarantee that the jobs launched in different processes are identical - which I believe is an assumption DDP holds. One of the root cause is Hydra's configs are resolved lazily, so for resolvers like this this would be resolved differently running the same Hydra application commands in a separate process.
For more context: running a simple PL+Hydra application in SINGLE RUN mode with ddp strategy on 2 GPUs will yield outputs like this (notice two sets of output dir were created, which is not an expected behavior.)
the
train.log
contains logs from rank 0 while train_ddp_process_1.log has the logs from rank 1 worker. (these logs files are created by Hydra, and can be configured via hydra/output)IMO the ideal outputs structure should be everything on the same node will be recorded in tht outputs dir and all the nodes log to
train.log
To address this, I think we can make some improvements on both sides for Hydra:
For PL: We can add a version check for Hydra - if Hydra >1.2, we should urge users to run the Hydra app with
hydra.run.chdir=False
so that PL does not need to do any custom overrides. (we can then get rid of this line) if we can provide the exact composed config from Hydra, PL can modify the Hydra application commands and passing the composed config. This will ensure that all configurations will be the same across all DDP workers.Hydra Multi-run
Multirun is more problematic here, and @jgbos has provided more context in this PR
I think the on potential solution is: in PL, we detect the Hydra is in MULTI RUN mode (with the help of #394), then we can overrider the PL command for subprocess to launch an exact same job (with the help of #1805)
launching multiple jobs will be taken care by Hydra running the same job on multinode would be taken care by PL. This seems like a cleaner split.
List of related existing issues