facebookresearch / hydra

Hydra is a framework for elegantly configuring complex applications
https://hydra.cc
MIT License
8.66k stars 623 forks source link

[Feature Request] Support loading YAML from user-specified URI #2455

Open addisonklinke opened 1 year ago

addisonklinke commented 1 year ago

🚀 Feature Request

Allow user-specified URI for @hydra.main(config_path=...) and the CLI --config-path parameters

Motivation

Is your feature request related to a problem? Please describe.

Currently the above parameters must be a local directory containing the YAML files. However, when reproducing an ML experiment, the .hydra/config.yaml from the original run is often in a remote location (S3, MLflow tracking server, etc). It would be convenient to specify the config path with a non-local URI such as s3://bucket/prefix/ or runs:/run_id/artifact/path

Pitch

Describe the solution you'd like

One approach - add a load_cfg_func parameter to @hydra.main. The default treats the argument as a local filepath, but users could override to customize the behavior based on their URI syntax. This way Hydra's code doesn't have to support all the possible storage locations

The user function would be required to return a local filepath for Hydra to reference. Likely this would involve automating the download and storing in a temporary location

Describe alternatives you've considered

  1. Manually login (browser or SDK) to the remote where the YAMLs are stored and download a local copy. This is the current workaround, but has a couple downsides
    1. Not as time-efficient
    2. Tougher to automate for CI/CD and deployment pipelines. Currently, you'd need a separate process to launch before Hydra to download the YAML(s) whereas with this FR, everything could be combined into the primary script
  2. Use a hook from the Callbacks API. Unfortunately all the available options come after the config is composed, so they would not help with this use-case. Additionally, this doesn't quite feel right for a callback

Are you willing to open a pull request? (See CONTRIBUTING)

Yes, with guidance

Jasha10 commented 1 year ago

I don't think it makes sense for Hydra to get into the business of querying remote servers to get configs. There's too much surface area, and it's unlikely that Hydra would be able to provide a one-size-fits-all solution.

Would it be possible to download the config via python before invoking Hydra?

from s3_api import download_from_remote_location
...

download_from_remote_location(
    "s3://bucket/.hydra/config.yaml",
    save_to="downloaded_cfg.yaml",
)

@hydra.main(config_name="downloaded_cfg.yaml", config_path=".")
def my_app(cfg):
    ...
addisonklinke commented 1 year ago

I could create a separate download function, but the issue is supporting dynamic config locations. I'd want the option to change s3://bucket/.hydra/config.yaml from the CLI which means I'd need to use sys.argv hacks to determine the argument to download_from_remote_location(). So similar to https://github.com/facebookresearch/hydra/issues/2459, it can be accomplished as-is, but doesn't look pretty

There's too much surface area, and it's unlikely that Hydra would be able to provide a one-size-fits-all solution

Even if the user is responsible for supplying the download function?

Jasha10 commented 1 year ago

I see.

Even if the user is responsible for supplying the download function?

I suppose that could work. I worry that it would not be trivial to implement this, however. What if there are multiple config files? I think that the user-supplied function would potentially need to download a whole directory-tree of configs. Locally, Hydra looks for config files in config_dir. The call @hydra.main(config_file="foo", config_dir="bar") spells "look for config files in bar, take bar/foo.yaml as the top-level config, and then pull in other configs from bar per foo's defaults list".

Jasha10 commented 1 year ago

I've hacked together a script that uses argparse to strip a --remote-config-location flag from sys.argv and then passes the rest of sys.argv to @hydra.main:

import argparse
import sys
from pathlib import Path
from typing import Optional

import hydra
from omegaconf import DictConfig

def download_from_remote_location(
    remote_config_location: str, save_to: Path | str
) -> None:
    save_to = Path(save_to)
    downloaded_config = f"key: value_from_{remote_config_location}"  # dummy yaml file
    save_to.write_text(downloaded_config)  # save to disk locally

def app(cfg: DictConfig) -> None:
    """The main function"""
    print(cfg)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--remote-config-location", type=str, default=None)
    parser.add_argument("hydra_args", type=str, nargs="*")
    args, unknownargs = parser.parse_known_args()

    remote_config_location: Optional[str] = args.remote_config_location

    if remote_config_location is None:
        # run the app as usual
        decorated_app = hydra.main(
            version_base=None, config_path="default_configs", config_name="default"
        )(app)
        decorated_app()
    else:
        # run the app with a downloaded config
        download_from_remote_location(
            remote_config_location, save_to="remote_config.yaml"
        )
        decorated_app = hydra.main(
            version_base=None, config_path=".", config_name="remote_config.yaml"
        )(app)

        sys.argv = (
            [sys.argv[0]] + unknownargs + args.hydra_args
        )  # argparse strips sys.argv[0], so add it back in.
        decorated_app()
$ python tmp.py +xyz=123
{'key': 'contents of default config', 'xyz': 123}
$ python tmp.py --remote-config-location=s3://... +xyz=123
{'key': 'value_from_s3://...', 'xyz': 123}

Tab completion still works pretty well (as long as you don't pass the --remote-config-location flag).

addisonklinke commented 1 year ago

This is very helpful to have for a complete reference, thanks!

Do you see any downsides to going without ArgumentParser and simply iterating through sys.argv looking for any overrides containing s3://? Something like this

S3_URI = 's3://'

def convert_remote_uris():
    for i, arg in enumerate(sys.argv):
        if S3_URI not in arg:
            continue
        artifact_uri = arg.split(S3_URI)[-1]
        local_path = download_artifact(artifact_uri)
        args[i] = re.sub(f'{S3_URI}.*', local_path, arg)

def download_from_remote_location(
    remote_config_location: str, save_to: Path | str
) -> None:
    save_to = Path(save_to)
    downloaded_config = f"key: value_from_{remote_config_location}"  # dummy yaml file
    save_to.write_text(downloaded_config)  # save to disk locally
    return save_to

@hydra.main(version_base=None, config_path="default_configs", config_name="default")
def app(cfg: DictConfig) -> None:
    """The main function"""
    print(cfg)

if __name__ == "__main__":
    convert_remote_uris()
    app()

EDIT: Since the parameters to @hydra.main() can also be handled as CLI flags like --config_path, I was thinking they'd be covered by convert_remote_uris() and take precedence over the decorator parameters

Jasha10 commented 1 year ago

Do you see any downsides to going without ArgumentParser and simply iterating through sys.argv looking for any overrides containing s3://?

Nope, your solution looks good to me!

addisonklinke commented 1 year ago

The other thing I thought of is using a custom resolver to perform the downloads. However, this would only work for actual config nodes and not overrides for Hydra main.

Do you prefer to close or leave open given the workaround?

Jasha10 commented 1 year ago

Sorry for the slow response.

... using a custom resolver ... would only work for actual config nodes and not overrides for Hydra main.

Yes indeed.

After thinking about this issue some more, I realized that Hydra's ConfigSourcePlugin feature might be perfect for this use-case. It is not a well-documented feature (it's only mentioned once in the docs here). I believe that Hydra's ConfigSourcePlugin API is designed to solve exactly this problem of finding config files that come from unusual sources.

Let's leave the issue open until we're able to confirm whether ConfigSourcePlugin does indeed cover this use-case.

addisonklinke commented 1 year ago

Ah interesting! Yes ConfigSource.load_config() sounds just like the @hydra.main(load_cfg_func=...) approach I was requesting. Is there an complete example anywhere that has all the abstract method implementations for an S3 ConfigSource plugin? I could readily adapt that to MLflow. I think I can see how the derived class should work, but I'm unsure how to declare the active plugins in my main script

Jasha10 commented 1 year ago

There's an example at examples/plugins/example_configsource_plugin.

Jasha10 commented 1 year ago

In addition to the example from my previous comment, see also these three subclasses of ConfigSource defined in hydra's source code:

jihunchoi commented 1 year ago

You can put these two files into hydra_plugins directory for hydra to recognize a path starting with s3://.

https://gist.github.com/jihunchoi/d85ac30ba9f8ab20bf076d84507fc592

Note that it's not the cleanest way of integrating S3 into hydra, since in my understanding currently a path with a custom scheme other than pkg:// is not recognizable and treated as a relative path: https://github.com/facebookresearch/hydra/blob/53d07f56a272485cc81596d23aad33e18e007091/hydra/_internal/utils.py#L118-L135

This means that all URIs in the form of s3://a/b/c will be recognized as a relative file path under PWD, and then converted into the absolute path PWD/s3:/a/b/c (double slashes // is replaced with a single slash /, due to os.path.normpath); thus the custom config source plugin (s3_config_source.py) will not automatically recognize the S3 path.

To mitigate this I made a simple but dirty trick of making a custom search path plugin (s3_search_path.py), to convert if there is a search path containing s3:/, recover the original path starting with s3://. By doing this the S3 path given can be passed correctly to the config source plugin defined in s3_config_source.py. This will normally not make a problem, since : is not allowed in filenames or directory names in linux file system. However if modification is made to https://github.com/facebookresearch/hydra/blob/53d07f56a272485cc81596d23aad33e18e007091/hydra/_internal/utils.py so that it just returns config_path itself when it is in the form of PROTOCOL://..., I guess the solution will be much stable and straightforward.

omry commented 1 year ago

Let's leave the issue open until we're able to confirm whether ConfigSourcePlugin does indeed cover this use-case.

Yes, ConfigSourcePlugin function is exactly to enable loading configs from additional backends.