NCAR / ctsm_python_gallery

A place to put sample workflows and tools that use ctsm model output
Apache License 2.0
18 stars 27 forks source link

Catalogues #9

Open wwieder opened 4 years ago

wwieder commented 4 years ago

Examples of how to use catalogues to input data and operate on multiple models / ensemble members.

wwieder commented 4 years ago

@jhamman can you point us to Anderson's example?

wwieder commented 4 years ago

Looks like we should use intake-esm?

jhamman commented 4 years ago

Yes, intake-esm would be a great place to start. @andersy005, we may want to sit down with you and learn a bit of how intake-esm could be useful for CTSM (and similar) land model ensembles.

andersy005 commented 4 years ago

@jhamman, we've been putting together a design document for intake-esm:

We decided to emulate what stac has been doing in https://github.com/radiantearth/stac-spec by

we may want to sit down with you and learn a bit of how intake-esm could be useful for CTSM

Feel free to ping me whenever you have time for us to discuss this in the coming weeks.

wwieder commented 4 years ago

Sounds like we need to do this for CESM2 output, if it hasn't been done already? We'd like to have this from glade/collections/cdg/* that has data in lots of different flavors and organization?

jhamman commented 4 years ago

According to @andersy005, the CESM2 output should now be available in the catalog:

import intake

col = intake.open_esm_datastore("/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip6.json")

col.search(source_id="CESM2").nunique()

activity_id           14
institution_id         1
source_id              1
experiment_id         57
member_id            102
table_id              32
variable_id          742
grid_label             2
dcpp_init_year         0
version               40
time_range           780
path              215511
dtype: int64
andersy005 commented 4 years ago

This includes the cmorized CESM2 runs only. We don't have a catalog for CESM2 runs that didn't get submitted to CMIP6 yet. I am planning on working on this in the coming days.

wwieder commented 4 years ago

Are there any updates here? How can we build a catalogue of land-only simulations that are on disk?

andersy005 commented 4 years ago

@wwieder

land-only simulations

Are these direct outputs from CESM i.e. history(time-slice) files with multiple variables in one file?

wwieder commented 4 years ago

No, they've been post processed into single variable time series, and sometimes cmor-ized Many cases are being held here /glade/p/cgd/tss/people/oleson/CLM_LAND_ONLY_RELEASE

Other users would likely like to see simulations that are part of CMIP, and submitted for LS3MIP *& LUMIP, but I'm not sure where these data are locally?

dlawrenncar commented 4 years ago

Expect that @andersy005 knows this, but the CMIP6 CESM land-only simulations are here in CESM output format: /glade/collections/cdg/timeseries-cmip6 (i.e21*) and here in CMORized format (I believe already pulled into the catalogue): /glade/collections/cdg/cmip6/

andersy005 commented 4 years ago

The collection/catalogue for the non-CMORized CMIP6 data resides here: /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/campaign-cesm2-cmip6-timeseries.json. This collection is for the CESM2 raw output that went into CMIP6 data located in campaign storage, accessible via GLADE on casper at /glade/campaign/collections/cmip/CMIP6/timeseries-cmip6

In [1]: import intake                                                                                                                        

In [2]:                                                                                                                                      

In [2]: col = intake.open_esm_datastore("/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/campaign-cesm2-cmip6-timeseries.json")
   ...:                                                                                                                                      

In [3]: col.df.head()                                                                                                                        
Out[3]: 
      experiment                                         case component  ... ctrl_branch_year ctrl_experiment ctrl_member_id
0  esm-piControl  b.e21.B1850.f09_g17.CMIP6-esm-piControl.001       atm  ...              501       piControl              1
1  esm-piControl  b.e21.B1850.f09_g17.CMIP6-esm-piControl.001       atm  ...              501       piControl              1
2  esm-piControl  b.e21.B1850.f09_g17.CMIP6-esm-piControl.001       atm  ...              501       piControl              1
3  esm-piControl  b.e21.B1850.f09_g17.CMIP6-esm-piControl.001       atm  ...              501       piControl              1
4  esm-piControl  b.e21.B1850.f09_g17.CMIP6-esm-piControl.001       atm  ...              501       piControl              1

[5 rows x 11 columns]

In [4]: col                                                                                                                                  
Out[4]: 
campaign-cesm2-cmip6-timeseries-ESM Collection with 279742 entries:
        > 13 experiment(s)

        > 30 case(s)

        > 6 component(s)

        > 22 stream(s)

        > 2636 variable(s)

        > 512 date_range(s)

        > 12 member_id(s)

        > 279742 path(s)

        > 18 ctrl_branch_year(s)

        > 5 ctrl_experiment(s)

        > 4 ctrl_member_id(s)

As of today (January, 30, 2020), this collection has 279,742 assets (netCDF files).

Note: You have to use Casper in order to access campaign storage. If this is an issue, we can put together a catalogue that points to the data residing in /glade/collections/cdg/cmip6/

Ccing @mnlevy1981 who created the collection/catalogue for the data residing on campaign storage.

andersy005 commented 4 years ago

Other users would likely like to see simulations that are part of CMIP, and submitted for LS3MIP *& LUMIP, but I'm not sure where these data are locally?

For simulations that are part of CMIP6 and submitted for LS3MIP & LUMIP, the collection/catalogue resides in /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip6.json

In [5]: col = intake.open_esm_datastore("/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip6.json")                    

In [6]: col.df.head()                                                                                                                        
Out[6]: 
  activity_id institution_id source_id  ...    version     time_range                                               path
0        PMIP           NCAR     CESM2  ...  v20200110            NaN  /glade/collections/cmip/CMIP6/PMIP/NCAR/CESM2/...
1        PMIP           NCAR     CESM2  ...  v20200110            NaN  /glade/collections/cmip/CMIP6/PMIP/NCAR/CESM2/...
2        PMIP           NCAR     CESM2  ...  v20200110            NaN  /glade/collections/cmip/CMIP6/PMIP/NCAR/CESM2/...
3        PMIP           NCAR     CESM2  ...  v20200110            NaN  /glade/collections/cmip/CMIP6/PMIP/NCAR/CESM2/...
4        PMIP           NCAR     CESM2  ...  v20200110  105101-110012  /glade/collections/cmip/CMIP6/PMIP/NCAR/CESM2/...

[5 rows x 12 columns]

In [7]: col                                                                                                                                  
Out[7]: 
glade-cmip6-ESM Collection with 1506961 entries:
        > 17 activity_id(s)

        > 28 institution_id(s)

        > 59 source_id(s)

        > 97 experiment_id(s)

        > 164 member_id(s)

        > 35 table_id(s)

        > 1028 variable_id(s)

        > 12 grid_label(s)

        > 59 dcpp_init_year(s)

        > 295 version(s)

        > 8476 time_range(s)

        > 1506961 path(s)
In [8]: col_subset = col.search(institution_id="NCAR")                                                                                       

In [9]: col_subset.unique(columns=["activity_id", "source_id"])                                                                              
Out[9]: 
{'activity_id': {'count': 16,
  'values': ['PMIP',
   'ScenarioMIP',
   'AerChemMIP',
   'RFMIP',
   'OMIP',
   'C4MIP',
   'GeoMIP',
   'DCPP',
   'CMIP',
   'CFMIP',
   'LUMIP',
   'GMMIP',
   'PAMIP',
   'DAMIP',
   'LS3MIP',
   'CDRMIP']},
 'source_id': {'count': 5,
  'values': ['CESM2',
   'CESM2-WACCM',
   'CESM1-1-CAM5-CMIP5',
   'CESM2-WACCM-FV2',
   'CESM2-FV2']}}