marc-white commented 1 month ago

Describe the bug

The current Builders/Translators seem to be unable to cope if the metadata.yaml for a new experiment is lacking a value for model. However, model is not defined to be a required element of the metadata.yaml files (see the docs).

To Reproduce

Attempt to do a test catalog build of the MOM6 data using the current HEAD of branch 175-data-request etc. The following is reported out of the PBS job handler:

Successfully wrote ESM catalog json file to: file:///scratch/tm70/mcw120/intake-cat-test/v0.1.4/source/OM4_025.JRA_RYF.json
Successfully wrote ESM catalog json file to: file:///scratch/tm70/mcw120/intake-cat-test/v0.1.4/source/panant-01-hycom1-v13.json
Traceback (most recent call last):
  File "/g/data/xp65/public/apps/med_conda/envs/access-med-0.6/bin/catalog-build", line 10, in <module>
    sys.exit(build())
  File "/home/120/mcw120/access-nri/access-nri-intake-catalog/src/access_nri_intake/cli.py", line 194, in build
    getattr(cm, method)(**args)
  File "/home/120/mcw120/access-nri/access-nri-intake-catalog/src/access_nri_intake/catalog/manager.py", line 127, in build_esm
    self._add()
  File "/home/120/mcw120/access-nri/access-nri-intake-catalog/src/access_nri_intake/catalog/manager.py", line 204, in _add
    self.dfcat.add(self.source, row.to_dict(), overwrite=overwrite)
  File "/g/data/xp65/public/apps/med_conda/envs/access-med-0.6/lib/python3.10/site-packages/intake_dataframe_catalog/core.py", line 296, in add
    raise DfFileCatalogError(
intake_dataframe_catalog.core.DfFileCatalogError: Cannot add entry with iterable metadata columns: ['realm', 'frequency', 'variable'] to dataframe catalog with iterable metadata columns: ['model', 'realm', 'frequency', 'variable'].  Please ensure that metadata entries are consistent.

Given the metadata.yaml for these experiments had no model set, none was added to the constituent Intake-ESM catalogues. My attempts using the Mom6Translator to hard-code in a model value for these experiments is causing an inconsistency.

Additional context

Found during test builds for #175 .

charles-turner-1 commented 1 month ago

I've got a feeling that model might not be a required field since we have ERA Reanalysis datasets in the catalog - I can see how it might confusing/misleading to label a reanalysis with a model?

I'm currently trying to track down the Era Interim metadata - will update with my findings from that.

marc-white commented 1 month ago

I can see in the Builders how, for example, the realm information is captured and updated from various sources, but I haven't been able to track down where model gets pulled in.

dougiesquire commented 1 month ago

See https://github.com/ACCESS-NRI/access-nri-intake-catalog/blob/017cd2d02978accec3bef89797b1d659c7cc2597/src/access_nri_intake/catalog/translators.py#L81-L85

marc-white commented 1 month ago

@dougiesquire so what happens if an Intake-ESM data store has been built that doesn't have a model column? That's what appears to have happened with the two Intake-ESM data stores that the system has built for #175 , and my working theory is that that has happened because the metadata.yaml that was used to create the stores lacks a model field.

dougiesquire commented 1 month ago

It should fail when trying to add the datastore to a catalog if model cannot be retrieved from the metadata.yaml (I think - it's been quite a while since I did any of this)?

marc-white commented 1 month ago

That is what happens, but it fails at the Translator stage due to the missing model information in the datastore, rather than the missing model in the YAML directly (I think - I need to force-break it a few more times to fully track it down). Hence my suggestion to force model to be required in the metadata.yaml, so when the system has to build the Intake-ESM catalogue itself, it does so with a model column.

dougiesquire commented 1 month ago

The contents of the metadata.yaml is stored in the metadata attribute of the Intake-ESM datastore. My guess is that the failure that you're seeing is during the final step of the process described in the link above:

- If the input source is an intake-esm datastore, the translator will first look for the column in the
  esmcat.df attribute, casting iterable columns to tuples. If the source is not an intake-esm datastore,
  this step is skipped.
- If that fails, the translator will then look for the column name as an attribute on the source itself
- If that fails, the translator will then look for the column name in the metadata attribute of the source

To test this, try adding a model field to the metadata.yaml and see if things work.

It's deliberate that model is only an optional field in the metadata.yaml. Consider the externally-created CMIP7 datastores. These contain many models and will change through time, so it's better to pull the models directly from the relevant column in the datastore rather than require someone write (and keep synchronized) the models into the metadata.yaml.

Very happy for alternative approaches but that's how things are set up at the moment (IIRC).

marc-white commented 1 month ago

Going back to the DefaultTranslator with my attempted MOM6 catalogues (#175 ), and adding the model field into the experiment metadata.yaml, does get things working (although I'm still not clear on where the model information gets carried through and kept properly; I'll take another look later).

However, I think this is still an issue - there's no flagging saying that you need to supply a model field to the metadata.yaml if your data doesn't contain a separate model column. I can think of a few (non-exclusive) ways to address this:

Have metadata-validate throw a Warning if it encounters a metadata.yaml without a model field, e.g., WARNING: Your metadata.yaml has no model field. This means we are expecting the input data to have a model attribute.

Add the ability for the Builder or Translator to supply a default value for model if one is not found. Right now, the error that comes out is something confusing about the broadcasting of arrays/columns:

Traceback (most recent call last):
File "/g/data/xp65/public/apps/med_conda/envs/access-med-0.6/bin/catalog-build", line 10, in <module>
sys.exit(build())
File "/home/120/mcw120/access-nri/access-nri-intake-catalog/src/access_nri_intake/cli.py", line 194, in build
getattr(cm, method)(**args)
File "/home/120/mcw120/access-nri/access-nri-intake-catalog/src/access_nri_intake/catalog/manager.py", line 128, in build_esm
self._add()
File "/home/120/mcw120/access-nri/access-nri-intake-catalog/src/access_nri_intake/catalog/manager.py", line 205, in _add
self.dfcat.add(self.source, row.to_dict(), overwrite=overwrite)
File "/g/data/xp65/public/apps/med_conda/envs/access-med-0.6/lib/python3.10/site-packages/intake_dataframe_catalog/core.py", line 296, in add
raise DfFileCatalogError(
intake_dataframe_catalog.core.DfFileCatalogError: Cannot add entry with iterable metadata columns: ['realm', 'frequency', 'variable'] to dataframe catalog with iterable metadata columns: ['model', 'realm', 'frequency', 'variable'].  Please ensure that metadata entries are consistent.

marc-white commented 1 month ago

(although I'm still not clear on where the model information gets carried through and kept properly; I'll take another look later).

To clarify, it looks like the model info (plus all the other metadata) gets put into the master metacatalog.csv, but only if the full build process completes successfully.

charles-turner-1 commented 1 week ago

I've spent some time looking into this this afternoon. I think that in the case of

Translators: I'm not convinced this case is worth worrying about too much. Most of the translators in translators.py have some logic specific to model/realm etc, so should be easy enough to work out that some additional logic is required to get things to hang together. Potentially we could add a warning to DefaultTranslator.ModelTranslator to point anyone who does get this error in the right direction?
Builders: The error you're getting is being raised by intake-dataframe-catalog & isn't really specific to models (or anything access-nri-intake-catalog related really, in some sense). I think we can
- Improve the error message in intake-dataframe-catalog by specifying the missing iterable metadata column.
- In access_nri_intake_catalog.catalog.manager.CatalogManager._add, we catch DfFileCatalogErrors, pattern match on this & add some extra relevant info.
I'm not sure we want to add a default model - it just kind of feels wrong/potentially misleading to me?

marc-white commented 1 week ago

Confirming what we discussed in today's meeting:

@marc-white to update documentation to explain the intracacies of the model keyword value;
@charles-turner-1 to improve error catching/handling between access-nri-intake-catalog and intake-dataframe-catalog

charles-turner-1 commented 1 week ago

Been finding it surprisingly difficult to reproduce the bug in a way that lets me write useful tests - @marc-white are you able to confirm whether it was this hash where you were having the issue?

marc-white commented 1 week ago

Been finding it surprisingly difficult to reproduce the bug in a way that lets me write useful tests - @marc-white are you able to confirm whether it was this hash where you were having the issue?

I think that was it? It's been a while...

charles-turner-1 commented 1 week ago

Cool, cheers - will start there and see where I end up

ACCESS-NRI / access-nri-intake-catalog

[BUG] Make `model` required for `metadata.yaml` files #223

Describe the bug

To Reproduce

Additional context