ESMValGroup / ESMValCore

ESMValCore: A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts.
https://www.esmvaltool.org
Apache License 2.0
42 stars 38 forks source link

esmvalcore.local._path2facets error when drs path contains elements which are not keys #2502

Open k-a-webb opened 1 month ago

k-a-webb commented 1 month ago

When drs is a path with both configurable (e.g., {user}) and not configurable (e.g., nc_output) keys,  _path2facets indexing of keys and paths fails.

Example:

drs = "{user}/{subdir}/{runid}/data/nc_output/CMIP6/CCCma/CCCma/{dataset}-{runid}/{exp}/{ensemble}/{mip}/{short_name}/{grid}/{version}"
path = Path( "/space/hall5/sitestore/eccc/crd/ccrn/users/rvs001/canesm_runs/sv-canam-001/data/nc_output/CMIP6/CCCma/CCCma/CanESM5-1-sv-canam-001/amip/r1i1p1f1/Amon/tas/gn/v20190429/tas_Amon_CanESM5-1-sv-canam-001_amip_r1i1p1f1_gn_200301-200812.nc" )

_path2facets( path, drs ) 

Gives the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[40], [line 24](vscode-notebook-cell:?execution_count=40&line=24)
     [21](vscode-notebook-cell:?execution_count=40&line=21)                 facet1, facet2 = key.split("}-{")
     [22](vscode-notebook-cell:?execution_count=40&line=22)                 facets[facet2] = values[idx].replace(f'{facets[facet1]}-', '')
---> [24](vscode-notebook-cell:?execution_count=40&line=24) _path2facets( path, drs )                

Cell In[40], [line 22](vscode-notebook-cell:?execution_count=40&line=22)
     [20](vscode-notebook-cell:?execution_count=40&line=20) if key not in facets:
     [21](vscode-notebook-cell:?execution_count=40&line=21)     facet1, facet2 = key.split("}-{")
---> [22](vscode-notebook-cell:?execution_count=40&line=22)     facets[facet2] = values[idx].replace(f'{facets[facet1]}-', '')

KeyError: 'dataset'

It incorrectly pairs keys to facets:

> facets {'user': 'CMIP6', 'subdir': 'CCCma', 'runid': 'CCCma', 'exp': 'amip', 'ensemble': 'r1i1p1f1', 'mip': 'Amon', 'short_name': 'tas', 'grid': 'gn', 'version': 'v20190429'}

and then fails when the indexing does not match.

A solution is to record the index of path.split:


def _path2facets(path: Path, drs: str) -> dict[str, str]:
    """Extract facets from a path using a DRS like '{facet1}/{facet2}'."""

    keys = []
    for key in re.findall(r'{(.*?)}[^-]', f'{drs} '):
        key = key.split('.')[0]  # Remove trailing .lower and .upper
        keys.append(key)

    key_indices = {}
    for key in keys:
        for i,part in enumerate( drs.split('/')[::-1] ): # iterate backwards
            if '{{{}}}'.format(key) == part:
                key_indices[key] = -(i+1)

    dirpath = path.parents[0] # get directory path

    facets = {}
    for key in key_indices:
        index = key_indices[key]
        if '{' not in dirpath.parts[index]: 
            facets[key] = dirpath.parts[index] # while multiple instances of key could appear, only need it once

    if len(facets) != len(keys):
        # Extract hyphen separated facet: {facet1}-{facet2},
        # where facet1 is already known.
        for idx, key in enumerate(keys):
            if key not in facets:
                facet1, facet2 = key.split("}-{")
                facets[facet2] = values[idx].replace(f'{facets[facet1]}-', '')

    return facets

 

k-a-webb commented 1 month ago

Noting the my proposed solution fails when keys are intermixed with text... e.g., Tier{tier}:

This works though:

def _path2facets(path: Path, drs: str) -> dict[str, str]:
    """Extract facets from a path using a DRS like '{facet1}/{facet2}'."""

    keys = []
    for key in re.findall(r'{(.*?)}[^-]', f'{drs} '):
        key = key.split('.')[0]  # Remove trailing .lower and .upper
        keys.append(key)

    key_indices = {}
    for key in keys:
        for i,part in enumerate( drs.split('/')[::-1] ): 
            if "}-{" in part and "}-{" not in key:  continue
            if '{{{}}}'.format(key) in part:
                key_indices[key] = -(i+1)

    # check that all keys found
    assert all([ key in key_indices for key in keys ]), "Error: Missing keys"

    dirpath = path.parents[0]

    facets = {}
    for key in key_indices:
        index = key_indices[key]
        if '{' not in dirpath.parts[index]: # deal with multi-key parts below
            facets[key] = dirpath.parts[index] 

    if len(facets) != len(keys):
        # Extract hyphen separated facet: {facet1}-{facet2},
        # where facet1 is already known.
        for idx, key in enumerate(keys):
            if key not in facets:
                facet1, facet2 = key.split("}-{")
                facets[facet2] = values[idx].replace(f'{facets[facet1]}-', '')

    return facets
valeriupredoi commented 1 month ago

hi @k-a-webb and many thanks for raising this! Correct me if I'm wrong, but you are trying to map

{user}/{subdir}/{runid}/data

to a path like

/space/hall5/sitestore/eccc/crd/ccrn/users/rvs001/canesm_runs/sv-canam-001/data

but that'll never work since {user} can not be /space/hall5/sitestore/eccc/crd/ccrn/users/rvs001 - the number of placeholders needs to match the number of elements delimited by the path delimiter /

k-a-webb commented 1 month ago

Hi!

the number of placeholders needs to match the number of elements delimited by the path delimiter /

I can see how this is necessary for _path2facets after investigating the code, but I did not find instructions in the documentation that it was a requirement when setting up config-developer.yml -- although I might missed it!

The code I suggested does not require every element in the path to be a placeholder. This is helpful when there is a set organization to your data (which in this case includes many fixed subdirectory names), but you want to limit the number of parameters required to define a dataset.

bouweandela commented 1 month ago

@k-a-webb You're welcome to make a pull request to improve this, it looks like you already have some good ideas.

You could probably set rootpath to /space/hall5/sitestore/eccc/crd/ccrn/users to avoid the issue mentioned in https://github.com/ESMValGroup/ESMValCore/issues/2502#issuecomment-2269379129.