Open marc-white opened 1 month ago
I've got a feeling that model
might not be a required field since we have ERA Reanalysis datasets in the catalog - I can see how it might confusing/misleading to label a reanalysis with a model?
I'm currently trying to track down the Era Interim metadata - will update with my findings from that.
I can see in the Builders
how, for example, the realm
information is captured and updated from various sources, but I haven't been able to track down where model
gets pulled in.
@dougiesquire so what happens if an Intake-ESM data store has been built that doesn't have a model
column? That's what appears to have happened with the two Intake-ESM data stores that the system has built for #175 , and my working theory is that that has happened because the metadata.yaml
that was used to create the stores lacks a model
field.
It should fail when trying to add the datastore to a catalog if model
cannot be retrieved from the metadata.yaml
(I think - it's been quite a while since I did any of this)?
That is what happens, but it fails at the Translator
stage due to the missing model
information in the datastore, rather than the missing model
in the YAML
directly (I think - I need to force-break it a few more times to fully track it down). Hence my suggestion to force model
to be required in the metadata.yaml
, so when the system has to build the Intake-ESM catalogue itself, it does so with a model
column.
The contents of the metadata.yaml
is stored in the metadata
attribute of the Intake-ESM datastore. My guess is that the failure that you're seeing is during the final step of the process described in the link above:
- If the input source is an intake-esm datastore, the translator will first look for the column in the
esmcat.df attribute, casting iterable columns to tuples. If the source is not an intake-esm datastore,
this step is skipped.
- If that fails, the translator will then look for the column name as an attribute on the source itself
- If that fails, the translator will then look for the column name in the metadata attribute of the source
To test this, try adding a model
field to the metadata.yaml
and see if things work.
It's deliberate that model
is only an optional field in the metadata.yaml
. Consider the externally-created CMIP7 datastores. These contain many models and will change through time, so it's better to pull the models directly from the relevant column in the datastore rather than require someone write (and keep synchronized) the models into the metadata.yaml
.
Very happy for alternative approaches but that's how things are set up at the moment (IIRC).
Going back to the DefaultTranslator
with my attempted MOM6 catalogues (#175 ), and adding the model
field into the experiment metadata.yaml
, does get things working (although I'm still not clear on where the model information gets carried through and kept properly; I'll take another look later).
However, I think this is still an issue - there's no flagging saying that you need to supply a model
field to the metadata.yaml
if your data doesn't contain a separate model
column. I can think of a few (non-exclusive) ways to address this:
metadata-validate
throw a Warning
if it encounters a metadata.yaml
without a model
field, e.g., WARNING: Your metadata.yaml has no model field. This means we are expecting the input data to have a model attribute.
Builder
or Translator
to supply a default value for model
if one is not found. Right now, the error that comes out is something confusing about the broadcasting of arrays/columns:
Traceback (most recent call last):
File "/g/data/xp65/public/apps/med_conda/envs/access-med-0.6/bin/catalog-build", line 10, in <module>
sys.exit(build())
File "/home/120/mcw120/access-nri/access-nri-intake-catalog/src/access_nri_intake/cli.py", line 194, in build
getattr(cm, method)(**args)
File "/home/120/mcw120/access-nri/access-nri-intake-catalog/src/access_nri_intake/catalog/manager.py", line 128, in build_esm
self._add()
File "/home/120/mcw120/access-nri/access-nri-intake-catalog/src/access_nri_intake/catalog/manager.py", line 205, in _add
self.dfcat.add(self.source, row.to_dict(), overwrite=overwrite)
File "/g/data/xp65/public/apps/med_conda/envs/access-med-0.6/lib/python3.10/site-packages/intake_dataframe_catalog/core.py", line 296, in add
raise DfFileCatalogError(
intake_dataframe_catalog.core.DfFileCatalogError: Cannot add entry with iterable metadata columns: ['realm', 'frequency', 'variable'] to dataframe catalog with iterable metadata columns: ['model', 'realm', 'frequency', 'variable']. Please ensure that metadata entries are consistent.
(although I'm still not clear on where the model information gets carried through and kept properly; I'll take another look later).
To clarify, it looks like the model info (plus all the other metadata) gets put into the master metacatalog.csv
, but only if the full build process completes successfully.
I've spent some time looking into this this afternoon. I think that in the case of
DefaultTranslator.ModelTranslator
to point anyone who does get this error in the right direction? Builders: The error you're getting is being raised by intake-dataframe-catalog & isn't really specific to models (or anything access-nri-intake-catalog related really, in some sense). I think we can
access_nri_intake_catalog.catalog.manager.CatalogManager._add
, we catch DfFileCatalogErrors, pattern match on this & add some extra relevant info.I'm not sure we want to add a default model - it just kind of feels wrong/potentially misleading to me?
Confirming what we discussed in today's meeting:
access-nri-intake-catalog
and intake-dataframe-catalog
Been finding it surprisingly difficult to reproduce the bug in a way that lets me write useful tests - @marc-white are you able to confirm whether it was this hash where you were having the issue?
Been finding it surprisingly difficult to reproduce the bug in a way that lets me write useful tests - @marc-white are you able to confirm whether it was this hash where you were having the issue?
I think that was it? It's been a while...
Cool, cheers - will start there and see where I end up
Describe the bug
The current Builders/Translators seem to be unable to cope if the
metadata.yaml
for a new experiment is lacking a value formodel
. However,model
is not defined to be a required element of themetadata.yaml
files (see the docs).To Reproduce
Attempt to do a test catalog build of the MOM6 data using the current HEAD of branch
175-data-request
etc. The following is reported out of the PBS job handler:Given the
metadata.yaml
for these experiments had nomodel
set, none was added to the constituent Intake-ESM catalogues. My attempts using theMom6Translator
to hard-code in amodel
value for these experiments is causing an inconsistency.Additional context
Found during test builds for #175 .