aradhakrishnanGFDL / CatalogBuilder

CatalogBuilder for data discovery and analysis
3 stars 4 forks source link

unified wrapper for cmip #131

Closed aradhakrishnanGFDL closed 2 weeks ago

aradhakrishnanGFDL commented 4 weeks ago

This PR addresses #129.

To test: Please test on a GFDL PP directory as well to make sure nothing broke there. Then, test the CMIP data using the following example or adapt to something else

./gen_intake_gfdl.py --config config-cmip.yaml

config-cmip.yaml is now in configs/ and has a test-case for CMIP.

Expected csv and json

JSON generated at: /home/a1r/github/CatalogBuilder/scripts/catalogcmip.json CSV generated at: /home/a1r/github/CatalogBuilder/scripts/catalogcmip.csv

aradhakrishnanGFDL commented 3 weeks ago

Add: table_id, grid_label, version_id to the json under aggregate_columns. Exploring other ways.

Then it works.

source /net2/rlm/analysis-scripts/example/env/bin/activate

import intake, intake_esm
col = "/home/a1r/github/CatalogBuilder/scripts/catalogcmip-2.json"
cat = intake.open_esm_datastore(col)
cat2 = cat.search(variable_id="tos",table_id="Oday",grid_label="gn")
dset_dict = cat2.to_dataset_dict(cdf_kwargs={'chunks': {'time':5}, 'decode_times': False})
dset_dict.keys()

dict_keys(['GFDL-ESM4.abrupt-4xCO2.r1i1p1f1.Oday.v20180701.tos.gn', 'GFDL-ESM4.1pctCO2.r1i1p1f1.Oday.v20180701.tos.gn', 'GFDL-ESM4.historical.r1i1p1f1.Oday.v20190726.tos.gn', 'GFDL-ESM4.historical.r2i1p1f1.Oday.v20180701.tos.gn', 'GFDL-ESM4.esm-hist.r1i1p1f1.Oday.v20180701.tos.gn', 'GFDL-ESM4.historical.r3i1p1f1.Oday.v20180701.tos.gn', 'GFDL-ESM4.piControl.r1i1p1f1.Oday.v20180701.tos.gn'])

aradhakrishnanGFDL commented 3 weeks ago

if we remove version_id from agg columns, still works.. but user needs to be mindful to search for specific version_id before the xarray dataset object can be used. No errors until you likely get to a plot where you will see there can be overlapping time periods.

col = "/home/a1r/github/CatalogBuilder/scripts/catalogcmip-3.json"

cat2 = cat.search(variable_id="dissocos",table_id="Omon",grid_label='gr')
dset_dict = cat2.to_dataset_dict(cdf_kwargs={'chunks': {'time':5}, 'decode_times': False})

--> The keys in the returned dictionary of datasets are constructed as follows: 'source_id.experiment_id.member_id.table_id.grid_label' █████████████████████████████████████████████████████████████| 100.00% [7/7 08:59<00:00] dset_dict.keys() dict_keys(['GFDL-ESM4.1pctCO2.r1i1p1f1.Oday.gr', 'GFDL-ESM4.abrupt-4xCO2.r1i1p1f1.Oday.gr', 'GFDL-ESM4.esm-hist.r1i1p1f1.Oday.gr', 'GFDL-ESM4.historical.r2i1p1f1.Oday.gr', 'GFDL-ESM4.historical.r1i1p1f1.Oday.gr', 'GFDL-ESM4.historical.r3i1p1f1.Oday.gr', 'GFDL-ESM4.piControl.r1i1p1f1.Oday.gr'])

Ofcourse, Oday/tos has only one version. What-if there are two versions?

cat2.df[(cat2.df['variable_id']=='dissocos') & (cat2.df['experiment_id']=='historical')]['version_id'].nunique

<bound method IndexOpsMixin.nunique of 0 v20180701 1 v20180701 2 v20180701 3 v20180701 4 v20180701 5 v20180701 6 v20180701 7 v20180701 8 v20180701 9 v20190726 10 v20190726 11 v20190726 12 v20190726 13 v20190726 14 v20190726 15 v20190726 16 v20190726 17 v20190726

cat2.df.groupby("variable_id")[["experiment_id", "grid_label","version_id","variable_id", "table_id"]].nunique()

             experiment_id  grid_label  version_id  variable_id  table_id
variable_id                                                              
dissocos                 6           1           2            1         1

Instead use this

cat2.df.groupby("variable_id")[["source_id","experiment_id","frequency","member_id","grid_label","version_id","variable_id", "table_id"]].nunique()

             source_id  experiment_id  frequency  member_id  grid_label  version_id  variable_id  table_id
variable_id                                                                                               
dissocos    
aradhakrishnanGFDL commented 3 weeks ago
>>> cat2.df.groupby("variable_id")[["source_id","experiment_id","frequency","modeling_realm","member_id","table_id","grid_label","chunk_freq","version_id"]].nunique()
             source_id  experiment_id  frequency  modeling_realm  member_id  table_id  grid_label  chunk_freq  version_id
variable_id                                                                                                              
aragos               1              1          0               0          1         1           1           0           1
baccos               1              1          0               0          1         1           1           0           1
bfeos                1              1          0               0          1         1           1           0           1
bsios                1              1          0               0          1         1           1           0           1
calcos               1              1          0               0          1         1           1           0           1
...                ...            ...        ...             ...        ...       ...         ...         ...         ...
zmicro               1              1          0               0          1         1           1           0           1
zmicroos             1              1          0               0          1         1           1           0           1
zooc                 1              1          0               0          1         1           1           0           1
zoocos               1              1          0               0          1         1           1           0           1
zos                  1              1          0               0          1         1           1           0           1

[118 rows x 9 columns]
aradhakrishnanGFDL commented 3 weeks ago

examples and tests are documented here https://github.com/aradhakrishnanGFDL/canopy-cats/blob/main/notebooks/cmip_example.ipynb

sample catalogs are in the path described in the notebooks to find them locally at GFDL. There is also a copy of the catalogs in https://github.com/aradhakrishnanGFDL/canopy-cats/tree/main/catalogs/cmip-eg