facebookresearch / hydra

Hydra is a framework for elegantly configuring complex applications
https://hydra.cc
MIT License
8.84k stars 638 forks source link

[Feature Request] I'd like to be able to do multirun over files matching a glob (ideally a glob that depends on other, non-sweeped params in the config file) #2942

Open mmcdermott opened 2 months ago

mmcdermott commented 2 months ago

🚀 Feature Request

I want to be able to specify a multirun job to sweep over a list of files matching a glob.

Motivation

I use hydra for a number of parallel data processing pipelines, and use multirun jobs to parallelize my commands out over different data shards. In these cases, I have had to write custom bash helpers to take file globs and turn them into lists of filepaths in the hydra syntax so that the sweeper recognizes this as a valid option. I can't use a custom resolver as those resolve after the sweeper has already begun sweeping through the parameters.

Pitch

Describe the solution you'd like Much like there is a range(0,N) helper for ranging a sweep over integral values, I would like a glob(file_glob) option I can put in a config or on the command line to have it sweep over all files matching that glob.

Describe alternatives you've considered We currently have this helper implemented as a python script which we package and release via pypi then use in a bash script to produce the input to the hydra program. This results in a syntax like my_hydra_app --multirun data.root=$DATA_DIR data.shard=$(expand_shards $DATA_DIR), whereas I would like the ability to do my_hydra_app --multirun data.root=$DATA_DIR data.shard=glob($DATA_DIR). Critically, I would also like to be able to put the data.shard=glob($data.root) into my hydra config, if possible, so that this can be configured in the .yaml file, not the command line.

Are you willing to open a pull request? (See CONTRIBUTING) Yes

Additional context

Add any other context or screenshots about the feature request here.