aradhakrishnanGFDL / CatalogBuilder

CatalogBuilder for data discovery and analysis
4 stars 4 forks source link

catalog vocabulary slightly incompatible with example analysis script usage #120

Open ceblanton opened 6 months ago

ceblanton commented 6 months ago

FRE Canopy is generating catalogs using:

module load fre/canopy

fre catalog build --overwrite -i $ppdir -o $ppdir/catalog

sed -i.bak -e 's/,P1M,/,monthly,/' $ppdir/catalog.csv

An example pp directory and catalog file are here:

The example analysis script usage (the Ray example) is:

module load python/3.9

source /net2/rlm/analysis-scripts/example/env/bin/activate

python3 -c "from freanalysis_clouds import CloudAnalysisScript; CloudAnalysisScript().run_analysis('/archive/Chris.Blanton/am5/am5f7b11r0/c96L65_am5f7b11r0_amip/gfdl.ncrc5-deploy-prod-openmp/pp/catalog.json', '/nbhome/$USER/sample-output')"

That fails with this message

/net2/rlm/analysis-scripts/example/env/lib/python3.9/site-packages/pydantic/deprecated/decorator.py:222: UserWarning: There are no datasets to load! Returning an empty dictionary.

  return self.raw_function(**d, **var_kwargs)

Traceback (most recent call last):

  File "<string>", line 1, in <module>

  File "/net2/rlm/analysis-scripts/example/env/lib/python3.9/site-packages/freanalysis_clouds/__init__.py", line 125, in run_analysis

    datasets[self.metadata.catalog_key(variable)],

KeyError: 'c96L65_am5f4b4r1-newrad_amip.monthly.na.atmos.high_cld_amt'

The mystery is that this very-similar catalog works:

/net2/rlm/analysis-scripts/example/catalog.json

The difference we think is "n/a" versus missing for the ensemble vocabulary.

Hopefully, the "fre catalog validate /path/to/schema.json /path/to/catalog-to-test.json" usage can detect this mismatch or inconsistency before we try to launch the script.

aradhakrishnanGFDL commented 6 months ago

cat = cat.search(variable_id="high_cld_amt") dset_dict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time':5}, 'decode_times': False})

--> The keys in the returned dictionary of datasets are constructed as follows: 'source_id.experiment_id.frequency.modeling_realm.variable_id.chunk_freq'

████████████████████████████████████████████████████████████████████████████████████████| 100.00% [2/2 00:04<00:00] dset_dict.keys() dict_keys(['am5.c96L65_am5f7b11r0_amip.P1M.atmos_level.high_cld_amt.P1Y', 'am5.c96L65_am5f7b11r0_amip.P1M.atmos.high_cld_amt.P1Y'])

aradhakrishnanGFDL commented 6 months ago

@ceblanton member_id is empty "" , when it's empty the logic in Ray's script perhaps should be to remove it in key name?

aradhakrishnanGFDL commented 6 months ago

or we enforce no null which may be something we discussed before.

aradhakrishnanGFDL commented 6 months ago

on May 9th, it was decided to use "na" as the default value for the aggregate columns rather than the empty values, to help maintain a "key pattern" at the early stage of adopting this. Down the line, we will provide examples to dynamically query for the dataset/key names.

aradhakrishnanGFDL commented 6 months ago

@ceblanton

PR is ready for member_id to be "na" by default. But, I realize Ray's key still is missing the chunk frequency which is an aggregate column. I am not sure if leaving it in the key or using a default for chunk_freq is a good idea. We can't possibly find unique datasets without that. But this also circles back to not having to hard-code these key names.

this now works:

am5.c96L65_am5f7b11r0_amip.P1M.na.atmos_level.high_cld_amt.P1Y

You can test:


import intake, intake_esm
cat = /home/a1r/cat/canopy/am5f7b11r0/c96L65_am5f7b11r0_amipn0513.json

import intake,intake_esm

cat = intake.open_esm_datastore(col)
cat_store = intake.open_esm_datastore(cat)

cat_subset = cat_store.search(variable_id="high_cld_amt")

dset_dict = cat_subset.to_dataset_dict(cdf_kwargs={'chunks': {'time':5}, 'decode_times': False})

#this gives the dataset names dynamically based on the search and existing catalog+spec. 

for k in dset_dict.keys(): 
    print(k)

#test for the new key that is expected to work

dset_dict['am5.c96L65_am5f7b11r0_amip.P1M.na.atmos_level.high_cld_amt.P1Y']
aradhakrishnanGFDL commented 6 months ago

figure generated : /nbhome/a1r/analysis-scripts/pngs/cloud-fraction.png

script used: https://github.com/aradhakrishnanGFDL/analysis-scripts/blob/prototype1-a1r/raytest.py

changes made are in my fork and its only for one suite

https://github.com/aradhakrishnanGFDL/analysis-scripts/tree/prototype1-a1r/freanalysis_clouds

aradhakrishnanGFDL commented 6 months ago

to support this, we need to remove source_id from the aggregation columns. MDTF uses it though. so let's discuss.. @ceblanton