Catalogues - Githubissues

wwieder commented 5 years ago

Examples of how to use catalogues to input data and operate on multiple models / ensemble members.

wwieder commented 5 years ago

@jhamman can you point us to Anderson's example?

wwieder commented 5 years ago

Looks like we should use intake-esm?

jhamman commented 5 years ago

Yes, intake-esm would be a great place to start. @andersy005, we may want to sit down with you and learn a bit of how intake-esm could be useful for CTSM (and similar) land model ensembles.

andersy005 commented 5 years ago

@jhamman, we've been putting together a design document for intake-esm:

Intake-esm Design document: https://hackmd.io/baYzRSxIQSSP_EIhzumWHA

We decided to emulate what stac has been doing in https://github.com/radiantearth/stac-spec by

Creating ESM collection spec, ideally as a csv file in conjunction with a YAML file . This would contain arbitrary metadata columns (e.g. source_id, member_id) plus perhaps some required ones (data_format, path). The paths can be files paths or web endpoints. This lives in a standalone repo (https://github.com/NCAR/esm-collection-spec) with a simple spec validator script. Generating catalogs is the responsibility of the data provider. Can be as simple as walking a directory tree or something else.
Refactoring intake-esm around the new catalog spec. Now its job is to parse the catalog and provide an intake interface to loading the data. No special cases. If the catalog matches the spec, intake-esm can handle it. I am currently working on the refactoring in https://github.com/NCAR/intake-esm/pull/135, and my hope is that by the time this is merged, it will be easier to extend intake-esm to support data holdings for CTSM, and other model ensembles.

we may want to sit down with you and learn a bit of how intake-esm could be useful for CTSM

Feel free to ping me whenever you have time for us to discuss this in the coming weeks.

wwieder commented 5 years ago

Sounds like we need to do this for CESM2 output, if it hasn't been done already? We'd like to have this from glade/collections/cdg/* that has data in lots of different flavors and organization?

jhamman commented 5 years ago

According to @andersy005, the CESM2 output should now be available in the catalog:

import intake

col = intake.open_esm_datastore("/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip6.json")

col.search(source_id="CESM2").nunique()

activity_id           14
institution_id         1
source_id              1
experiment_id         57
member_id            102
table_id              32
variable_id          742
grid_label             2
dcpp_init_year         0
version               40
time_range           780
path              215511
dtype: int64

andersy005 commented 5 years ago

This includes the cmorized CESM2 runs only. We don't have a catalog for CESM2 runs that didn't get submitted to CMIP6 yet. I am planning on working on this in the coming days.

wwieder commented 4 years ago

Are there any updates here? How can we build a catalogue of land-only simulations that are on disk?

andersy005 commented 4 years ago

@wwieder

land-only simulations

Are these direct outputs from CESM i.e. history(time-slice) files with multiple variables in one file?

wwieder commented 4 years ago

No, they've been post processed into single variable time series, and sometimes cmor-ized Many cases are being held here /glade/p/cgd/tss/people/oleson/CLM_LAND_ONLY_RELEASE

Other users would likely like to see simulations that are part of CMIP, and submitted for LS3MIP *& LUMIP, but I'm not sure where these data are locally?

dlawrenncar commented 4 years ago

Expect that @andersy005 knows this, but the CMIP6 CESM land-only simulations are here in CESM output format: /glade/collections/cdg/timeseries-cmip6 (i.e21*) and here in CMORized format (I believe already pulled into the catalogue): /glade/collections/cdg/cmip6/

andersy005 commented 4 years ago

The collection/catalogue for the non-CMORized CMIP6 data resides here: /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/campaign-cesm2-cmip6-timeseries.json. This collection is for the CESM2 raw output that went into CMIP6 data located in campaign storage, accessible via GLADE on casper at /glade/campaign/collections/cmip/CMIP6/timeseries-cmip6

In [1]: import intake                                                                                                                        

In [2]:                                                                                                                                      

In [2]: col = intake.open_esm_datastore("/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/campaign-cesm2-cmip6-timeseries.json")
   ...:                                                                                                                                      

In [3]: col.df.head()                                                                                                                        
Out[3]: 
      experiment                                         case component  ... ctrl_branch_year ctrl_experiment ctrl_member_id
0  esm-piControl  b.e21.B1850.f09_g17.CMIP6-esm-piControl.001       atm  ...              501       piControl              1
1  esm-piControl  b.e21.B1850.f09_g17.CMIP6-esm-piControl.001       atm  ...              501       piControl              1
2  esm-piControl  b.e21.B1850.f09_g17.CMIP6-esm-piControl.001       atm  ...              501       piControl              1
3  esm-piControl  b.e21.B1850.f09_g17.CMIP6-esm-piControl.001       atm  ...              501       piControl              1
4  esm-piControl  b.e21.B1850.f09_g17.CMIP6-esm-piControl.001       atm  ...              501       piControl              1

[5 rows x 11 columns]

In [4]: col                                                                                                                                  
Out[4]: 
campaign-cesm2-cmip6-timeseries-ESM Collection with 279742 entries:
        > 13 experiment(s)

        > 30 case(s)

        > 6 component(s)

        > 22 stream(s)

        > 2636 variable(s)

        > 512 date_range(s)

        > 12 member_id(s)

        > 279742 path(s)

        > 18 ctrl_branch_year(s)

        > 5 ctrl_experiment(s)

        > 4 ctrl_member_id(s)

As of today (January, 30, 2020), this collection has 279,742 assets (netCDF files).

Note: You have to use Casper in order to access campaign storage. If this is an issue, we can put together a catalogue that points to the data residing in /glade/collections/cdg/cmip6/

Ccing @mnlevy1981 who created the collection/catalogue for the data residing on campaign storage.

andersy005 commented 4 years ago

Other users would likely like to see simulations that are part of CMIP, and submitted for LS3MIP *& LUMIP, but I'm not sure where these data are locally?

For simulations that are part of CMIP6 and submitted for LS3MIP & LUMIP, the collection/catalogue resides in /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip6.json

In [5]: col = intake.open_esm_datastore("/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip6.json")                    

In [6]: col.df.head()                                                                                                                        
Out[6]: 
  activity_id institution_id source_id  ...    version     time_range                                               path
0        PMIP           NCAR     CESM2  ...  v20200110            NaN  /glade/collections/cmip/CMIP6/PMIP/NCAR/CESM2/...
1        PMIP           NCAR     CESM2  ...  v20200110            NaN  /glade/collections/cmip/CMIP6/PMIP/NCAR/CESM2/...
2        PMIP           NCAR     CESM2  ...  v20200110            NaN  /glade/collections/cmip/CMIP6/PMIP/NCAR/CESM2/...
3        PMIP           NCAR     CESM2  ...  v20200110            NaN  /glade/collections/cmip/CMIP6/PMIP/NCAR/CESM2/...
4        PMIP           NCAR     CESM2  ...  v20200110  105101-110012  /glade/collections/cmip/CMIP6/PMIP/NCAR/CESM2/...

[5 rows x 12 columns]

In [7]: col                                                                                                                                  
Out[7]: 
glade-cmip6-ESM Collection with 1506961 entries:
        > 17 activity_id(s)

        > 28 institution_id(s)

        > 59 source_id(s)

        > 97 experiment_id(s)

        > 164 member_id(s)

        > 35 table_id(s)

        > 1028 variable_id(s)

        > 12 grid_label(s)

        > 59 dcpp_init_year(s)

        > 295 version(s)

        > 8476 time_range(s)

        > 1506961 path(s)
In [8]: col_subset = col.search(institution_id="NCAR")                                                                                       

In [9]: col_subset.unique(columns=["activity_id", "source_id"])                                                                              
Out[9]: 
{'activity_id': {'count': 16,
  'values': ['PMIP',
   'ScenarioMIP',
   'AerChemMIP',
   'RFMIP',
   'OMIP',
   'C4MIP',
   'GeoMIP',
   'DCPP',
   'CMIP',
   'CFMIP',
   'LUMIP',
   'GMMIP',
   'PAMIP',
   'DAMIP',
   'LS3MIP',
   'CDRMIP']},
 'source_id': {'count': 5,
  'values': ['CESM2',
   'CESM2-WACCM',
   'CESM1-1-CAM5-CMIP5',
   'CESM2-WACCM-FV2',
   'CESM2-FV2']}}

NCAR / ctsm_python_gallery

Catalogues #9