[Feature Request] I'd like to be able to do multirun over files matching a glob (ideally a glob that depends on other, non-sweeped params in the config file) #2942
I want to be able to specify a multirun job to sweep over a list of files matching a glob.
Motivation
I use hydra for a number of parallel data processing pipelines, and use multirun jobs to parallelize my commands out over different data shards. In these cases, I have had to write custom bash helpers to take file globs and turn them into lists of filepaths in the hydra syntax so that the sweeper recognizes this as a valid option. I can't use a custom resolver as those resolve after the sweeper has already begun sweeping through the parameters.
Pitch
Describe the solution you'd like
Much like there is a range(0,N) helper for ranging a sweep over integral values, I would like a glob(file_glob) option I can put in a config or on the command line to have it sweep over all files matching that glob.
Describe alternatives you've considered
We currently have this helper implemented as a python script which we package and release via pypi then use in a bash script to produce the input to the hydra program. This results in a syntax like my_hydra_app --multirun data.root=$DATA_DIR data.shard=$(expand_shards $DATA_DIR), whereas I would like the ability to do my_hydra_app --multirun data.root=$DATA_DIR data.shard=glob($DATA_DIR). Critically, I would also like to be able to put the data.shard=glob($data.root) into my hydra config, if possible, so that this can be configured in the .yaml file, not the command line.
Are you willing to open a pull request? (See CONTRIBUTING)
Yes
Additional context
Add any other context or screenshots about the feature request here.
🚀 Feature Request
I want to be able to specify a multirun job to sweep over a list of files matching a glob.
Motivation
I use hydra for a number of parallel data processing pipelines, and use multirun jobs to parallelize my commands out over different data shards. In these cases, I have had to write custom bash helpers to take file globs and turn them into lists of filepaths in the hydra syntax so that the sweeper recognizes this as a valid option. I can't use a custom resolver as those resolve after the sweeper has already begun sweeping through the parameters.
Pitch
Describe the solution you'd like Much like there is a
range(0,N)
helper for ranging a sweep over integral values, I would like aglob(file_glob)
option I can put in a config or on the command line to have it sweep over all files matching that glob.Describe alternatives you've considered We currently have this helper implemented as a python script which we package and release via pypi then use in a bash script to produce the input to the hydra program. This results in a syntax like
my_hydra_app --multirun data.root=$DATA_DIR data.shard=$(expand_shards $DATA_DIR)
, whereas I would like the ability to domy_hydra_app --multirun data.root=$DATA_DIR data.shard=glob($DATA_DIR)
. Critically, I would also like to be able to put thedata.shard=glob($data.root)
into my hydra config, if possible, so that this can be configured in the.yaml
file, not the command line.Are you willing to open a pull request? (See CONTRIBUTING) Yes
Additional context
Add any other context or screenshots about the feature request here.