facebookresearch / hydra

Hydra is a framework for elegantly configuring complex applications
https://hydra.cc
MIT License
8.8k stars 633 forks source link

Improving Hydra + PL + DDP experience #2070

Open jieru-hu opened 2 years ago

jieru-hu commented 2 years ago

Some issues I observed from trying out DDP + Hydra + PL (mostly on a single node with multiple GPUs). By no means this is a comprehensive list of existing issues. I want to use this for some initial discussions and see how can we improve from Hydra's side.

Hydra Single run

sort of works with rough edges. PL has to hack around the fact that Hydra changes working dir at job run time (which can be avoided starting Hydra 1.2) https://github.com/PyTorchLightning/pytorch-lightning/blob/2d043857eef55eb327528b70c8b33c680e82fb7b/pytorch_lightning/strategies/launchers/subprocess_script.py#L140-L145

This is definitely not ideal: There's no guarantee that the jobs launched in different processes are identical - which I believe is an assumption DDP holds. One of the root cause is Hydra's configs are resolved lazily, so for resolvers like this this would be resolved differently running the same Hydra application commands in a separate process.

For more context: running a simple PL+Hydra application in SINGLE RUN mode with ddp strategy on 2 GPUs will yield outputs like this (notice two sets of output dir were created, which is not an expected behavior.)

# the dir/logs created from the main process
outputs
└── 2022-03-04
    └── 11-39-58
        ├── .hydra
        │   ├── config.yaml
        │   ├── hydra.yaml
        │   └── overrides.yaml
        └── train.log

# The output dir/logs created from running the same Hydra commands in the subprocess for DDP
.hydra
├── config.yaml
├── hydra.yaml
└── overrides.yaml
train_ddp_process_1.log

the train.log contains logs from rank 0 while train_ddp_process_1.log has the logs from rank 1 worker. (these logs files are created by Hydra, and can be configured via hydra/output)

IMO the ideal outputs structure should be everything on the same node will be recorded in tht outputs dir and all the nodes log to train.log

To address this, I think we can make some improvements on both sides for Hydra:

For PL: We can add a version check for Hydra - if Hydra >1.2, we should urge users to run the Hydra app with hydra.run.chdir=False so that PL does not need to do any custom overrides. (we can then get rid of this line) if we can provide the exact composed config from Hydra, PL can modify the Hydra application commands and passing the composed config. This will ensure that all configurations will be the same across all DDP workers.

Hydra Multi-run

Multirun is more problematic here, and @jgbos has provided more context in this PR

I think the on potential solution is: in PL, we detect the Hydra is in MULTI RUN mode (with the help of #394), then we can overrider the PL command for subprocess to launch an exact same job (with the help of #1805)

# get the .hydra/resolved.yaml config from hydra.runtime.output_dir
job_config=HydraConfig.get().runtime.output_dir + ".hydra/resolved.yaml"
python train.py --config job_config 

launching multiple jobs will be taken care by Hydra running the same job on multinode would be taken care by PL. This seems like a cleaner split.

List of related existing issues

jieru-hu commented 2 years ago

@jgbos thank you for taking the time to improve Hydra + PL. Sharing some early thoughts here...

Jasha10 commented 2 years ago

I don't think there's much downside of dumping an extra yaml file in the output dir.

One downside is that resolving could be expensive if the user employs custom resolvers that do heavy computation. See my comment here.

jgbos commented 2 years ago

I've been quite busy lately so I haven't had much to time to address the PR. It does look like PL is updating their approach to allow for easy custom launching. Yet, I think this should be even easier for Hydra users. Since we have configs before execution a user can create a pl_main.py in their repo with:

def task(cfg: DictConfig) -> None:
    trainer = instantiate(cfg.trainer)
    module = instantiate(cfg.module)

    if cfg._ddp_fitting:
        trainer.fit(module)
    else:
        trainer.test(module)

@hydra.main(config_path=None, config_name="config")
def main(cfg):
    task(cfg)

if __name__ == "__main__":
    main()

and the PL command to launch each DDP subprocess should just be:

python -m my_repo.pl_main -cn config.yaml -cp <path to config.yaml> +_ddp_fitting=TrainerFn.FITTING hydra.run.dir=<cwd>

What are you thoughts on this?

jieru-hu commented 2 years ago

thanks @jgbos for taking a look :)

I'm not exactly sure what is the suggestion here (probably due to my lack of context in this area). Could we chat in zulip or meet and discuss this?

jieru-hu commented 2 years ago

thanks @jgbos @Jasha10 for taking a look. I've added some more thoughts on MULTIRUN above. Pls take a look when you get a chance.

Jasha10 commented 2 years ago

Let me cross-link to my comment here on @jgbos's PR.

Jasha10 commented 2 years ago

I think the on potential solution is: in PL, we detect the Hydra is in MULTI RUN mode (with the help of https://github.com/facebookresearch/hydra/issues/394), then we can overrider the PL command for subprocess to launch an exact same job (with the help of https://github.com/facebookresearch/hydra/issues/1805)

So #394 means that we will be able to detect multirun mode by inspecting HydraConfig.get(), right?

Overall the plan sounds good.

Does the following example make sense to you @jieru-hu?

jgbos commented 2 years ago

@jieru-hu apologies for the slow response. I've been swamped. Hopefully have more time in the next week. I'll let you know if you still want to chat. As for my comment, it's related to the PL solution with Hydra, not a Hydra solution specifically. It would definitely be interesting to see what a Hydra solution would be.

jieru-hu commented 2 years ago

I think the on potential solution is: in PL, we detect the Hydra is in MULTI RUN mode (with the help of #394), then we can overrider the PL command for subprocess to launch an exact same job (with the help of #1805)

So #394 means that we will be able to detect multirun mode by inspecting HydraConfig.get(), right?

Overall the plan sounds good.

Does the following example make sense to you @jieru-hu?

  • User invokes python script.py abc/pqr=xyz foo=0,1 --multirun
  • This spawns two multirun jobs (one with foo=0 and one with foo=1). For this example, let's say the working directories for these jobs are ./multirun/.../0 and ./multirun/.../1
  • Job 0 should save a resolved copy of the job config at ./multirun/.../0/.hydra/resolved.yaml, and similarly for job 1.
  • In job 0, PL should launch the subprocess using some solution from #1805, pointing Hydra to the ./multirun/.../0/.hydra directory so that an identical config can be produced. PL does the same thing for Job 1, referencing ./multirun/.../1/.hydra.

exactly :)

jieru-hu commented 2 years ago

@jieru-hu apologies for the slow response. I've been swamped. Hopefully have more time in the next week. I'll let you know if you still want to chat. As for my comment, it's related to the PL solution with Hydra, not a Hydra solution specifically. It would definitely be interesting to see what a Hydra solution would be.

sure, sounds good!

jgbos commented 2 years ago

So I'm trying to revive https://github.com/PyTorchLightning/pytorch-lightning/pull/11617. Where can I find more info on the result of #1805?

edit

nvm. I found the re-run documentation.

libokj commented 1 year ago

As of 2023/09/09, running a sweep/multirun of multiple PL training jobs with DDP still results in newly spawned ddp processes (train_ddp_process_1, etc.) creating multiple output_dirs of the same multirun with multiple copies of multirun.yaml:

...
  job:
    name: train_ddp_process_1 # and there are copies of multirun.yaml corresponding to `train_ddp_process_2` and so on
    chdir: null
    override_dirname: sweep=ddp_test,tags=ddp_test,trainer.max_epochs=2
    id: ???
    num: ???
    config_name: train.yaml
    env_set: {}
    env_copy: []
...

my hydra config:

...
run:
  dir: ${paths.log_dir}/${job_name}/runs/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
sweep:
  dir: ${paths.log_dir}/${job_name}/multiruns/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
  # Sanitize override_dirname by replacing '/' with ',' to avoid unintended subdirectory creation
  subdir: ${eval:'"${hydra.job.override_dirname}".replace("/", ".")'}

job_logging:
  handlers:
    file:
      filename: ${hydra.runtime.output_dir}/job.log
...

@jieru-hu I'm willing to look into it. Can you give me some advice?

libokj commented 1 year ago

As of 2023/09/09, running a sweep/multirun of multiple PL training jobs with DDP still results in newly spawned ddp processes (train_ddp_process_1, etc.) creating multiple output_dirs of the same multirun with multiple copies of multirun.yaml:

...
  job:
    name: train_ddp_process_1 # and there are copies of multirun.yaml corresponding to `train_ddp_process_2` and so on
    chdir: null
    override_dirname: sweep=ddp_test,tags=ddp_test,trainer.max_epochs=2
    id: ???
    num: ???
    config_name: train.yaml
    env_set: {}
    env_copy: []
...

my hydra config:

...
run:
  dir: ${paths.log_dir}/${job_name}/runs/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
sweep:
  dir: ${paths.log_dir}/${job_name}/multiruns/${now:%Y-%m-%d}_${now:%H-%M-%S}_${tags}
  # Sanitize override_dirname by replacing '/' with ',' to avoid unintended subdirectory creation
  subdir: ${eval:'"${hydra.job.override_dirname}".replace("/", ".")'}

job_logging:
  handlers:
    file:
      filename: ${hydra.runtime.output_dir}/job.log
...

@jieru-hu I'm willing to look into it. Can you give me some advice?

Turns out the problem is on the lightning side (https://github.com/Lightning-AI/lightning/blob/1bee50b31d78c66f1182571a2c022356625a8fbb/src/lightning/fabric/strategies/launchers/subprocess_script.py#L175). Lightning fixes hydra.run.dir for all subproccesses launched for hydra, but that is only used in single run mode, not in sweep/multirun, which uses hydra.sweep.dir and hydra.sweep.subdir to configure output_dir. I think changing hydra.run.dir to hydra.runtime.output_dir can partially fix this issue, but that creates a minor problem of subprocess multirun.yaml overwriting the multirun.yaml created by the main process.

I could create a PR for this. Do you have any other suggestions? @jieru-hu

odelalleau commented 1 year ago

Thanks for looking into it -- note that @jieru-hu may not be involved anymore with Hydra.

I don't have a good enough understanding of what's happening exactly to provide suggestions, but feel free to submit a PR if you feel like you have a potential solution!

libokj commented 1 year ago

Thanks for looking into it -- note that @jieru-hu may not be involved anymore with Hydra.

I don't have a good enough understanding of what's happening exactly to provide suggestions, but feel free to submit a PR if you feel like you have a potential solution!

Thanks for the info. It would actually be a PR on the Lightning side. Right now my monkey-patching solution is as follows:

def _hydra_subprocess_cmd(local_rank: int):
    """
    Monkey patching for lightning.fabric.strategies.launchers.subprocess_script._hydra_subprocess_cmd
    Temporarily fixes the problem of unnecessarily creating log folders for DDP subprocesses in Hydra multirun/sweep.
    """
    import __main__  # local import to avoid https://github.com/Lightning-AI/lightning/issues/15218
    from hydra.core.hydra_config import HydraConfig
    from hydra.types import RunMode
    from hydra.utils import get_original_cwd, to_absolute_path

    # when user is using hydra find the absolute path
    if __main__.__spec__ is None:  # pragma: no-cover
        command = [sys.executable, to_absolute_path(sys.argv[0])]
    else:
        command = [sys.executable, "-m", __main__.__spec__.name]

    command += sys.argv[1:] + [f"hydra.job.name=train_ddp_process_{local_rank}", "hydra.output_subdir=null"]

    cwd = get_original_cwd()
    config = HydraConfig.get()

    if config.mode == RunMode.RUN:
        command += [f'hydra.run.dir="{config.run.dir}"']
    elif config.mode == RunMode.MULTIRUN:
        command += [f'hydra.sweep.dir="{config.sweep.dir}"']

    print(command)

    return command, cwd

lightning.fabric.strategies.launchers.subprocess_script._hydra_subprocess_cmd = _hydra_subprocess_cmd
lightning.pytorch.strategies.launchers.subprocess_script._hydra_subprocess_cmd = _hydra_subprocess_cmd

This solution pretty much fixes the problem of DDP subprocesses creating unwanted job log folders when running a sweep/multirun with Lightning and DDP training, but one small problem remains as I previously suspected: DDP subprocesses still create multirun.yaml, and the last subproccess' multirun.yaml will overwrite the multirun.yaml created by the main process. It would be nice to allow disabling saving multirun.yaml in subprocesses like DDP (see #2341, #1773, #1937). @omry

I'll update once I create a PR for Lightning.

Update: Joined the discussion on an existing issue for Lightning: https://github.com/Lightning-AI/lightning/pull/18175.

jgbos commented 1 year ago

I tried to solve this awhile ago and even got a PR merged, but apparently it broke some other things and got reverted. You can take a look at the code here: https://github.com/Lightning-AI/lightning/pull/11617

libokj commented 1 year ago

I tried to solve this awhile ago and even got a PR merged, but apparently it broke some other things and got reverted. You can take a look at the code here: Lightning-AI/lightning#11617

Hey, thanks for the heads-up. I took a look at the code and the back-and-forths throughout version changes. I agree that "the current method of spawning using sys.argv is not robust". For now, my monkey patch works as a temporary fix for my project, but I'd like to contribute to a better solution if I can. We shall see if we get an update from the creator of https://github.com/Lightning-AI/lightning/pull/18175.

mfoglio commented 8 months ago

Any update? Is the issue still present?

konstantinos-pitas-cognex commented 15 hours ago

Hello! Any hacky way of getting ddp and hydra multirun to work? Maybe exiting ddp processes in a particular way? Looking forward to this being solved! (hydra is awesome!)