facebookresearch / hydra

Hydra is a framework for elegantly configuring complex applications
https://hydra.cc
MIT License
8.66k stars 623 forks source link

[Feature Request] `exclude_keys` in `override_dirname` should have regex/wildcard/glob(*) option #1873

Open thoth291 opened 2 years ago

thoth291 commented 2 years ago

🚀 Feature Request

If in my command line I would like to exclude app.verbose and app.config - then in override_dirname I could write one of these to make it work:

# configfile options which I'm trying to `multirun`
params:
    a: 4
    name: test
app:
    verbose: false
    config: false
    setup: true
hydra:
  run:
    dir: ${hydra.job.name}/single/${now:%Y-%m-%d}/${now:%H-%M-%S}
  sweep:
    dir: ${hydra.job.name} #_${hydra.job.id}
    subdir: ${hydra.job.num}_${hydra.job.override_dirname}
  job:
    name: ${params.name}
    override_dirname: ???
    config:
      # configuration for the ${hydra.job.override_dirname} runtime variable
      override_dirname:
        kv_sep: '='
        item_sep: ','
        # currently I'm using [app.verbose,app.config,app.setup]
        exclude_keys: [glob(app.*)]
# commandline arguments I would pass
simple_app params.a=4,5,6 app.verbose=true -m

so above line would give me override_dirname="params.a=4" for one of the sweep tasks.

Motivation

This would simplify directory naming for runs and would allow better integration with command line functionality.

Is your feature request related to a problem? Please describe. It's related to issue #1872 - that one also using cmd as primary driver for the sweep setup. And overall both of these issues should make commandline functionality synchronized with yaml file experience users currently have.

Pitch

Describe the solution you'd like The solution would allow to exclude based on more generic criteria - which would simplify robustness towards future life of the configuration of the app.

Describe alternatives you've considered The only alternative is not to use it and configure dirname using custom interpolation and logic. But that would limit generalizability - since user would have to modify this logic each time new argument will be considered for sweeping.

Are you willing to open a pull request? no

Additional context

I will take any workaround which would require to modify only main config.yaml file once and for all possible future sweeps or any custom python logic I need to write into my app.

Thanks in Advance!

Jasha10 commented 2 years ago

Hi @thoth291,

Here is a workaround that hopefully will work for you. It involves registering an OmegaConf custom resolver to calculate the set of keys represented by the syntax glob(app.*) that you have proposed:

# simple_app.py
import os
from typing import List
from hydra.core.hydra_config import HydraConfig
import hydra
from omegaconf import DictConfig, OmegaConf

def my_glob_impl(pattern: str, _root_: DictConfig) -> List[str]:
    """
    A simple glob implementation, takes a `pattern` with wildcard `*` at the
    end.  The return value is a set of full keys in the config which match the
    `pattern`.
    """
    assert pattern.endswith(".*")
    pattern = pattern.removesuffix(".*")
    node = OmegaConf.select(_root_, key=pattern)
    if node is None:
        return []
    if not isinstance(node, DictConfig):
        raise NotImplementedError
    else:
        return [f"{pattern}.{key}" for key in node.keys()]

OmegaConf.register_new_resolver(name="my_glob", resolver=my_glob_impl)

@hydra.main(config_path=".", config_name="config")
def my_app(_cfg: DictConfig) -> None:
    exclude_keys = HydraConfig.get().job.config.override_dirname.exclude_keys
    print(f"{exclude_keys=}")
    print(f"Working dir: {os.getcwd()}")

if __name__ == "__main__":
    my_app()
# config.yaml
params:
    a: 4
    name: test
app:
    verbose: false
    config: false
    setup: true

hydra:
  run:
    dir: ${hydra.job.name}/single/${now:%Y-%m-%d}/${now:%H-%M-%S}
  sweep:
    dir: ${hydra.job.name} #_${hydra.job.id}
    subdir: ${hydra.job.num}_${hydra.job.override_dirname}
  job:
    name: ${params.name}
    override_dirname: ???
    config:
      # configuration for the ${hydra.job.override_dirname} runtime variable
      override_dirname:
        kv_sep: '='
        item_sep: ','
        # currently I'm using [app.verbose,app.config,app.setup]
        exclude_keys: '${my_glob: app.*}'
$ # at the command line:
$ $ python3 simple_app.py params.a=4,5,6 app.verbose=true -m
[2021-11-03 19:40:55,849][HYDRA] Launching 3 jobs locally
[2021-11-03 19:40:55,849][HYDRA]        #0 : params.a=4 app.verbose=True
exclude_keys=['app.verbose', 'app.config', 'app.setup']
Working dir: /Users/jasha10/hydra_tmp/tmp1873/test/0_params.a=4
[2021-11-03 19:40:55,941][HYDRA]        #1 : params.a=5 app.verbose=True
exclude_keys=['app.verbose', 'app.config', 'app.setup']
Working dir: /Users/jasha10/hydra_tmp/tmp1873/test/1_params.a=5
[2021-11-03 19:40:56,054][HYDRA]        #2 : params.a=6 app.verbose=True
exclude_keys=['app.verbose', 'app.config', 'app.setup']
Working dir: /Users/jasha10/hydra_tmp/tmp1873/test/2_params.a=6

The basic idea is to use the my_glob_impl function to create the list of keys that you want to exclude. If you use this my_glob resolver in your Hydra config, then the _root_ argument passed to my_glob_impl will be the root config that is composed by Hydra. The glob functionality could be made more advanced by e.g. supporting wildcards in multiple places or allowing some recursive expansion (e.g. glob app.** to get all nested keys).

Edit: I've reversed the order of the if not isinstance(node, DictConfig) and the if node is None blocks in the my_glob_impl function body. The if node is None check should come first, as otherwise that if-block will never be reached.

thoth291 commented 2 years ago

@Jasha10 , thank you for your suggestion! It works perfectly fine. One minor question is left - I can't get my head around how to add to that list. For example:

simple_app \
         '+sweep={name:t4,a:4},{name:z50,a:50}' \
         params.name='${sweep.name}' params.a='${sweep.a}' \
         app.verbose=true \
         -m

Then my exclude_keys should have not only glob for app.* but also should have sweep as key to be excluded. *The question is how to combine dynamical list from `'${my_glob: app.}'with static list[sweep]`?** Do I need to write yet another resolver or there is already available resolver for list merging?

Jasha10 commented 2 years ago

You certainly could write yet another resolver:

OmegaConf.register_new_resolver(name="concat", resolver=lambda *lists: [elt for l in lists for elt in l])
        exclude_keys: '${concat: ${my_glob: app.*}, [sweep]}'

Another option would be to modify the my_glob_impl function above to take a variable number of arguments:

def multi_glob_impl(patterns: list[str], _root_: DictConfig) -> List[str]:
    """
    Like `my_glob_impl`, with two differences:
      - takes a list of `patterns` instead of one pattern
      - it is allowed for patterns to not end with a wildcard `.*` in which
        case no globbing is performed.
    """
    ret = []
    for pattern in patterns:
        if pattern.endswith(".*"):
            pattern = pattern.removesuffix(".*")
            node = OmegaConf.select(_root_, key=pattern)
            if node is None:
                continue
            if not isinstance(node, DictConfig):
                raise NotImplementedError(type(node))
            else:
                ret += [f"{pattern}.{key}" for key in node.keys()]
        else:
            ret.append(pattern)
    return ret
OmegaConf.register_new_resolver(name="multi_glob", resolver=multi_glob_impl)
        exclude_keys: '${multi_glob: [app.*, sweep]}'

However, neither of these options allows you to extend exclude_keys from the command line. To do that, let's suppose you have a top-level dict called my_excludes in your config:

# config.yaml
params:
    a: 4
    name: test
app:
    verbose: false
    config: false
    setup: true

my_excludes:
  app: app.*
  sweep: sweep
  self: keys_to_exclude

hydra:
  run:
    dir: ${hydra.job.name}/single/${now:%Y-%m-%d}/${now:%H-%M-%S}
  sweep:
    dir: ${hydra.job.name} #_${hydra.job.id}
    subdir: ${hydra.job.num}_${hydra.job.override_dirname}
  job:
    name: ${params.name}
    override_dirname: ???
    config:
      # configuration for the ${hydra.job.override_dirname} runtime variable
      override_dirname:
        kv_sep: '='
        item_sep: ','
        # currently I'm using [app.verbose,app.config,app.setup]
        exclude_keys: '${multi_glob: ${oc.dict.values: my_excludes}}'

Above we are using OmegaConf's built-in oc.dict.values resolver to get a list of values from the top-level my_excludes mapping. These values are then passed to the multi_glob resolver to generate the exclude_keys list.

So the default excludes will be app.*, sweep, and keys_to_exclude. You can override this from the command-line as follows:

$ python3 simple_app.py params.a=4,5,6 app.verbose=true '~my_excludes.app' '+my_excludes.p=params.*' -m
[2021-11-05 11:30:02,198][HYDRA] Launching 3 jobs locally
[2021-11-05 11:30:02,198][HYDRA]        #0 : params.a=4 app.verbose=True ~my_excludes.app=null +my_excludes.p=params.*
exclude_keys=['sweep', 'keys_to_exclude', 'params.a', 'params.name']
Working dir: /home/jasha10/hydra_tmp/tmp1873/test/0_+my_excludes.p=params.*,app.verbose=True,~my_excludes.app=null
[2021-11-05 11:30:02,289][HYDRA]        #1 : params.a=5 app.verbose=True ~my_excludes.app=null +my_excludes.p=params.*
exclude_keys=['sweep', 'keys_to_exclude', 'params.a', 'params.name']
Working dir: /home/jasha10/hydra_tmp/tmp1873/test/1_+my_excludes.p=params.*,app.verbose=True,~my_excludes.app=null
[2021-11-05 11:30:02,390][HYDRA]        #2 : params.a=6 app.verbose=True ~my_excludes.app=null +my_excludes.p=params.*
exclude_keys=['sweep', 'keys_to_exclude', 'params.a', 'params.name']
Working dir: /home/jasha10/hydra_tmp/tmp1873/test/2_+my_excludes.p=params.*,app.verbose=True,~my_excludes.app=null

As you can see, with the '~my_excludes.app' command-line override, we remove "app.*" from the list of excludes, and with +my_excludes.p=params.* we are adding the glob "params.*" to the set of excludes.

The motivation for having the top-level my_excludes be a dict instead of a list is that it is easier to manipulate a dict using the command-line syntax. The values of the my_excludes dict are the important part (as the values of my_excludes are what gets passed to multi_glob_impl).

Jasha10 commented 2 years ago

Looking back at the above, using a regex pattern will give the most flexibility when deciding which overrides to exclude. This would require implementing your own logic to construct the directory name based on the list of overrides that are used for the current job. You can access the list of overrides in ${hydra.overrides.task}.

Here's what I have in mind, using regex patterns to see whether each override should be excluded:

import hydra
import os
from omegaconf import OmegaConf, ListConfig

def my_subdir_suffix_impl(
    task_overrides: ListConfig,  # list[str]: overrides passed at command line
    exclude_patterns: ListConfig,  # list[str]: regex patterns to exclude
) -> str:
    """Return a sting: concatenation of overrides that are not matched by any of the `exclude_patterns`."""
    import re

    rets: list[str] = []
    for override in task_overrides:
        should_exclude = any(
            re.search(exc_pat, override) for exc_pat in exclude_patterns
        )
        if not should_exclude:
            rets.append(override)

    return "_".join(rets)

OmegaConf.register_new_resolver("my_subdir_suffix", my_subdir_suffix_impl)

@hydra.main(config_path=".", config_name="config")
def main(cfg):
    print(f"{os.getcwd()=}")

main()
# config.yaml
params:
    a: 4
    name: test
app:
    verbose: false
    config: false
    setup: true

my_excludes:
  app: app.*
  sweep: sweep
  self: my_excludes

hydra:
  sweep:
    dir: ${hydra.job.name} #_${hydra.job.id}
    subdir: "${hydra.job.num}_${my_subdir_suffix: ${hydra.overrides.task}, ${oc.dict.values:my_excludes}}"
$ python3 simple_app.py params.a=4,5,6 app.verbose=true '~my_excludes.app' '+my_excludes.p="params.*"' -m
[2021-11-07 11:25:35,072][HYDRA] Launching 3 jobs locally
[2021-11-07 11:25:35,072][HYDRA]        #0 : params.a=4 app.verbose=True ~my_excludes.app=null +my_excludes.p="params.*"
os.getcwd()='/home/jasha10/hydra_tmp/tmp1873/simple_app/0_app.verbose=True'
[2021-11-07 11:25:35,161][HYDRA]        #1 : params.a=5 app.verbose=True ~my_excludes.app=null +my_excludes.p="params.*"
os.getcwd()='/home/jasha10/hydra_tmp/tmp1873/simple_app/1_app.verbose=True'
[2021-11-07 11:25:35,259][HYDRA]        #2 : params.a=6 app.verbose=True ~my_excludes.app=null +my_excludes.p="params.*"
os.getcwd()='/home/jasha10/hydra_tmp/tmp1873/simple_app/2_app.verbose=True'
thoth291 commented 2 years ago

So many options now! This is really great - I ended up using concat method at the moment - but will re-investigate it later - once I have few other people look at it. One thing which is weird to me:

I also never used oc.dict.values and hydra.overrides.task before - they are very neat things - which I will shamelessly stole from you ;-).

Jasha10 commented 2 years ago

I also never used oc.dict.values and hydra.overrides.task before - they are very neat things - which I will shamelessly stole from you ;-).

Haha, good!

you are using my_excludes.p - where is p defined?

I am using one of the techniques from the Modifying the Config Object section of the docs.

At the command line, I typed '+my_excludes.p="params.*"'. The plus symbol + takes care of adding the key "p" to my_excludes. If the plus were left out then Hydra would fail with an error.

I used the plus here to demonstrate how you can dynamically add keys to my_excludes using the command line. In this particular example, typing '+my_excludes.p="params.*"' at the command line prevents override keys starting with "params." from appearing in the output directory name. Meanwhile, my use of a tilde ~ in the override '~my_excludes.app' demonstrates how to delete a key from my_excludes at the CLI.

in the config file you are using self: my_excludes - how is that working?

The word "self" is not special here; I could have used e.g. foobar: my_excludes instead.

What matters is that the string "my_excludes" shows up as one of the values in the cfg.my_excludes DictConfig. This prevents the word "my_excludes" from appearing in the name of the output directory. For example, you can try deleting the self: my_excludes line from the config file to see what the output directory name is.

# With `self: my_excludes` deleted from the config:
$ python3 simple_app.py '~my_excludes.app' -m
...
os.getcwd()='/home/jbss/hydra_tmp/tmp1873/simple_app/0_~my_excludes.app=null'

# With `self: my_excludes` included in the config:
$ python3 simple_app.py '~my_excludes.app' -m
...
os.getcwd()='/home/jbss/hydra_tmp/tmp1873/simple_app/0_'