Uppercase project facet in dataset identifier is not recognized while parsing facets

AtefBN commented 6 years ago

Current process doesn't impose any specific casing on the project part of the dataset identifier, however an uppercase project facet in the dataset id causes an issue while parsing facets.

asladeofgreen commented 6 years ago

@AtefBN So can you confirm that mixed case is permitted, e.g. CMIP6 | cmip6 ? If so can you point me to the relevant place in the documentation where this is stated ?

AtefBN commented 6 years ago

@momipsl My understanding that the most recent change is that only CMIP6 is permitted uppercase only. But @glevava is the one more up to speed about this.

glevava commented 6 years ago

@momipsl I don't understand your question, mixed in the project value (or mip_era in the CMIP6 context) are allowed. The issue comes from the "dataset_id" formatting which was never properly define in the ESGF. Historically, the first dataset_id key (i.e., "project" or "mip_era" for CMIP6) was set to lowercase by default even if it's uppercase in netCDF global attribute or in the DRS. CMIP6 starts a new paradigm where all the dataset_id facets have to follow the case (which could be lower, upper or mixed) from the DRS itself (or the CMIP6 specifications in our case). I don't know exactly how this dataset_id checking is handled by pyessv but it appears that "cmip6.CMIP.IPSL.[...]" falls into a pyessv error but not with "CMIP6.CMIP.IPSL.[...]".

asladeofgreen commented 6 years ago

pyessv assumes that first element in a dataset identifier is lower case, all other elements are as per DRS.

As you said the dataset id formatting was never formally defined ... it should all be lower-case without exception. Using mixed / upper case is simply a hack due to the ESG-F & WCRP vocabulary sub-systems not being able to distinguish between a canonical-name, a raw-name, and a (UI) label.

glevava commented 6 years ago

I agree. But we finally should be consistent in one way or another: A. pyessv assumes that ALL facet values are lower case B. pyessv assumes that ALL facet values are as per DRS (with mixed case).

In the context of the errata service the dataset_id string can be reformatted before calling pyessv in order to follow A or B assumption.

asladeofgreen commented 6 years ago

FYI the pyessv dataset id parsers are found here:

https://github.com/ES-DOC/pyessv/blob/master/pyessv/_parsers/cmip5_dataset_id.py#L25 https://github.com/ES-DOC/pyessv/blob/master/pyessv/_parsers/cmip6_dataset_id.py#L22 https://github.com/ES-DOC/pyessv/blob/master/pyessv/_parsers/cordex_dataset_id.py#L25

You will observe that the initial element of each template, i.e. project / mip-era, is considered to be a lower-case constant. You will also observe that the so-called parsing strictness is PARSING_STRICTNESS_1, i.e. the elements are validated against the DRS names, e.g. IPSL (note the upper case).

asladeofgreen commented 6 years ago

@glevava The formatting of DRS elements found within a dataset identifier should be clearly defined in a DRS white paper. I have assumed that all facet values are as per DRS, however the initial element, i.e. project/mip-era casing is ambiguos, what about cordex for example which is not a mip-era.

glevava commented 6 years ago

@momipsl The formatting of all DRS elements are clearly identified by the CMIP6 specifications and WCRP documents. What is not is the CV usage in the ESGF ecosystem to build the "dataset_id" or the "CoG searh UI" which are purely ESGF components. The "dataset_id" rules (whatever the project source) are missing in the ESGF and so discrepancies appear between the projects.

CORDEX dataset_ids have lower case project key.

asladeofgreen commented 6 years ago

Can you please confirm:

CMIP5: initial element is lower case, i.e. cmip5 is valid ?

CMIP6: initial element is mixed-case, i.e. cmip6, CMIP6 or CmiP6 are all valid ?

CORDEX: initial element is lower case, i.e. cordex is valid ?

glevava commented 6 years ago

Regarding the dataset_id format only:

no, all CMIP5 data have been publisher with "cmip5.output[12]...." (-> lowercase only) CMIP6: initial element is mixed-case, i.e. cmip6, CMIP6 or CmiP6 are all valid ?
"CMIP6" only. (-> uppercase only) CORDEX: initial element is lower case, i.e. cordex is valid ?
"cordex", only. (-> lowercase only)

Please note that future projects supported by the errata as "obs4MIPs" or "input4MIPs" will have mixed case "project/mip_era" facet value.

asladeofgreen commented 6 years ago

OK I will update the relevant dataset-id parsers to reflect this, actually only the CMIP6 parser has to be updated. I will then push to GitHub and the ES-DOC WebFaction servers so that you guys can test.

I have been thinking about creating a web-service interface to pyessv, perhaps a validate-dataset-id endpoint may be a good starting point.

asladeofgreen commented 6 years ago

Also the esgf-publisher config file for CMIP6 is IMHO incorrect. The dataset_id value should hard-code CMIP6 rather than mip_era:

dataset_id = %(mip_era)s.%(activity_id)s.%(institution_id)s.%(source_id)s.%(experiment_id)s.%(member_id)s.%(table_id)s.%(variable_id)s.%(grid_label)s

dataset_id = CMIP6.%(activity_id)s.%(institution_id)s.%(source_id)s.%(experiment_id)s.%(member_id)s.%(table_id)s.%(variable_id)s.%(grid_label)s

glevava commented 6 years ago

@momipsl You're right I'll modify it.

ES-DOC / esdoc-errata-ws

Uppercase project facet in dataset identifier is not recognized while parsing facets #2