ESMValGroup / ESMValTool

ESMValTool: A community diagnostic and performance metrics tool for routine evaluation of Earth system models in CMIP
https://www.esmvaltool.org
Apache License 2.0
210 stars 121 forks source link

Dataset pattern matching #3262

Open rebeccaherman1 opened 1 year ago

rebeccaherman1 commented 1 year ago

The '*' syntax for including all climate models or all runs when defining the dataset on ESMValTool is wonderful, but occasionally I am looking for combinations of subsets of available simulations. For instance, if I wanted the historical and piControl runs from CESM2 and GISS-E2-1-G, it seems the only option is to write 4 repetitive lines in the dataset section. I read that the yaml file supports globbing, but it doesn't seem to work in the dataset section. It would be wonderful to be able to search for such a subset of simulations in 1 line using pattern matching, for instance: datasets: - {dataset: '{CESM2,GISS-E2-1-G}', institute: '*', project: CMIP6, exp: '{historical,piControl}', mip: Amon}

bouweandela commented 12 months ago

I read that the yaml file supports globbing, but it doesn't seem to work in the dataset section.

Could you share the recipe you're trying to run and the expected result?

rebeccaherman1 commented 12 months ago

The rest of the recipe is perhaps not so relevant? But above is an idea of the dataset line, and I'm hoping it will be equivalent to:

datasets: 
  - {dataset: CESM2, institute: '*', project: CMIP6, exp: historical, mip: Amon}
  - {dataset: CESM2, institute: '*', project: CMIP6, exp: piControl, mip: Amon}
  - {dataset: GISS-E2-1-G, institute: '*', project: CMIP6, exp: historical, mip: Amon}
  - {dataset: GISS-E2-1-G, institute: '*', project: CMIP6, exp: piControl, mip: Amon}
bouweandela commented 12 months ago

The example above is not a valid (part of a) recipe, so with just this information, it is not easy to understand what you're trying to achieve.

rebeccaherman1 commented 12 months ago

Sorry I'm new to ESMValTool and was a little distracted. I've edited the longer dataset section above; I hope it's now closer to correct.

bouweandela commented 12 months ago

Thanks, it now looks like a correct datasets section, but without the variables section, I still cannot really tell what you're trying to do. You mention that "the yaml file supports globbing, but it doesn't seem to work in the dataset section", are you referring to the institute: '*' facet used in the datasets above when you say that it doesn't work?

rebeccaherman1 commented 12 months ago

No -- institute: '*' works well :) But sometimes I don't want all available institutes or datasets. Say I want all available institutes ('*') where the dataset matches CESM2 or GISS-E2-1-G. I would like to be able to use globing (such as dataset: '{CESM2,GISS-E2-1-G}') to refer to this specific subset of the available data without having to write two mostly-redundant lines of code. In my first post, I try to give an example where I use globing for dataset and for exp, and thus could write one line of code instead of 4. If I am doing this with multiple facets, the number of saved lines of code grows quite rapidly. This becomes even more important if not every combination of dataset, exp, and other facets I hope to limit is available -- I don't want to have to check that before writing out 20 lines of dataset code with every individual combination that actually exists.

bouweandela commented 12 months ago

Thanks for explaining. I think the confusion comes from calling the new recipe syntax you propose 'globbing'. Most people mean pattern matching based on the syntax described here when they say 'globbing'. This is what is currently supported and should (hopefully) work.

I don't want to have to check that before writing out 20 lines of dataset code with every individual combination that actually exists.

There is no need to find and write out the datasets 'by hand'. You could have a look at this example notebook where it is done automatically: composing-recipes.ipynb. Similar to how it is done in this notebook, you can write a small script that finds all available datasets, removes unwanted datasets from the list, and finally writes the wanted datasets in recipe format.

rebeccaherman1 commented 12 months ago

That is helpful, thank you.

I suppose I was looking at something like this when I was trying to understand globbing, which also includes some multi-character pattern options in curly braces. I see that the description you linked to only includes single character patterns.

I still think accommodating multi-character patterns would be nice, but, I did not previously know about that notebook for composing recipes! It seems helpful; I'll have a look.

bouweandela commented 9 months ago

Yes, that looks nice. Would you know a Python library that implements this? Because it does not appear to work with the Python standard library fnmatch module.

rebeccaherman1 commented 5 months ago

You are right -- I don't see any python module that implements this. I don't know if fnmatch would consider adding such functionality, when it isn't in the UNIX definition. Would be really helpful, though...

Is fnmatch used internally in ESMValTool when searching for files?

bouweandela commented 5 months ago

Yes, you can see it here: https://github.com/ESMValGroup/ESMValCore/blob/1839787d11233553ef8c969371ec5ab8e0520e1c/esmvalcore/dataset.py#L76-L79

If there was a library, then it would be really easy to add support for these extended glob patterns by just replacing fnmatchcase by a function from that library.

rebeccaherman1 commented 5 months ago

I think I found a library that could work!

https://facelessuser.github.io/wcmatch/glob/#syntax

bouweandela commented 4 months ago

That looks like a very nice library. The only risk I see is that it seems to be maintained by a single developer. Would you like to try and integrate it into ESMValCore? I think that the only bit of code you would need to adapt are these two functions: https://github.com/ESMValGroup/ESMValCore/blob/1839787d11233553ef8c969371ec5ab8e0520e1c/esmvalcore/dataset.py#L70-L79