Open wwieder opened 5 years ago
@jhamman can you point us to Anderson's example?
Looks like we should use intake-esm?
Yes, intake-esm would be a great place to start. @andersy005, we may want to sit down with you and learn a bit of how intake-esm could be useful for CTSM (and similar) land model ensembles.
@jhamman, we've been putting together a design document for intake-esm
:
We decided to emulate what stac
has been doing in https://github.com/radiantearth/stac-spec by
Creating ESM collection spec
, ideally as a csv file in conjunction with a YAML file . This would contain arbitrary metadata columns (e.g. source_id
, member_id
) plus perhaps some required ones (data_format, path). The paths can be files paths or web endpoints. This lives in a standalone repo (https://github.com/NCAR/esm-collection-spec) with a simple spec validator script. Generating catalogs is the responsibility of the data provider. Can be as simple as walking a directory tree or something else.
Refactoring intake-esm around the new catalog spec. Now its job is to parse the catalog and provide an intake interface to loading the data. No special cases. If the catalog matches the spec, intake-esm can handle it. I am currently working on the refactoring in https://github.com/NCAR/intake-esm/pull/135, and my hope is that by the time this is merged, it will be easier to extend intake-esm
to support data holdings for CTSM, and other model ensembles.
we may want to sit down with you and learn a bit of how intake-esm could be useful for CTSM
Feel free to ping me whenever you have time for us to discuss this in the coming weeks.
Sounds like we need to do this for CESM2 output, if it hasn't been done already? We'd like to have this from glade/collections/cdg/*
that has data in lots of different flavors and organization?
According to @andersy005, the CESM2 output should now be available in the catalog:
import intake
col = intake.open_esm_datastore("/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip6.json")
col.search(source_id="CESM2").nunique()
activity_id 14
institution_id 1
source_id 1
experiment_id 57
member_id 102
table_id 32
variable_id 742
grid_label 2
dcpp_init_year 0
version 40
time_range 780
path 215511
dtype: int64
This includes the cmorized CESM2 runs only. We don't have a catalog for CESM2 runs that didn't get submitted to CMIP6 yet. I am planning on working on this in the coming days.
Are there any updates here? How can we build a catalogue of land-only simulations that are on disk?
@wwieder
land-only simulations
Are these direct outputs from CESM i.e. history(time-slice) files with multiple variables in one file?
No, they've been post processed into single variable time series, and sometimes cmor-ized Many cases are being held here /glade/p/cgd/tss/people/oleson/CLM_LAND_ONLY_RELEASE
Other users would likely like to see simulations that are part of CMIP, and submitted for LS3MIP *& LUMIP, but I'm not sure where these data are locally?
Expect that @andersy005 knows this, but the CMIP6 CESM land-only simulations are here in CESM output format: /glade/collections/cdg/timeseries-cmip6 (i.e21*) and here in CMORized format (I believe already pulled into the catalogue): /glade/collections/cdg/cmip6/
The collection/catalogue for the non-CMORized CMIP6 data resides here: /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/campaign-cesm2-cmip6-timeseries.json
. This collection is for the CESM2 raw output that went into CMIP6 data located in campaign storage, accessible via GLADE on casper at /glade/campaign/collections/cmip/CMIP6/timeseries-cmip6
In [1]: import intake
In [2]:
In [2]: col = intake.open_esm_datastore("/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/campaign-cesm2-cmip6-timeseries.json")
...:
In [3]: col.df.head()
Out[3]:
experiment case component ... ctrl_branch_year ctrl_experiment ctrl_member_id
0 esm-piControl b.e21.B1850.f09_g17.CMIP6-esm-piControl.001 atm ... 501 piControl 1
1 esm-piControl b.e21.B1850.f09_g17.CMIP6-esm-piControl.001 atm ... 501 piControl 1
2 esm-piControl b.e21.B1850.f09_g17.CMIP6-esm-piControl.001 atm ... 501 piControl 1
3 esm-piControl b.e21.B1850.f09_g17.CMIP6-esm-piControl.001 atm ... 501 piControl 1
4 esm-piControl b.e21.B1850.f09_g17.CMIP6-esm-piControl.001 atm ... 501 piControl 1
[5 rows x 11 columns]
In [4]: col
Out[4]:
campaign-cesm2-cmip6-timeseries-ESM Collection with 279742 entries:
> 13 experiment(s)
> 30 case(s)
> 6 component(s)
> 22 stream(s)
> 2636 variable(s)
> 512 date_range(s)
> 12 member_id(s)
> 279742 path(s)
> 18 ctrl_branch_year(s)
> 5 ctrl_experiment(s)
> 4 ctrl_member_id(s)
As of today (January, 30, 2020), this collection has 279,742 assets (netCDF files).
Note: You have to use Casper in order to access campaign storage. If this is an issue, we can put together a catalogue that points to the data residing in /glade/collections/cdg/cmip6/
Ccing @mnlevy1981 who created the collection/catalogue for the data residing on campaign storage.
Other users would likely like to see simulations that are part of CMIP, and submitted for LS3MIP *& LUMIP, but I'm not sure where these data are locally?
For simulations that are part of CMIP6 and submitted for LS3MIP & LUMIP, the collection/catalogue resides in /glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip6.json
In [5]: col = intake.open_esm_datastore("/glade/collections/cmip/catalog/intake-esm-datastore/catalogs/glade-cmip6.json")
In [6]: col.df.head()
Out[6]:
activity_id institution_id source_id ... version time_range path
0 PMIP NCAR CESM2 ... v20200110 NaN /glade/collections/cmip/CMIP6/PMIP/NCAR/CESM2/...
1 PMIP NCAR CESM2 ... v20200110 NaN /glade/collections/cmip/CMIP6/PMIP/NCAR/CESM2/...
2 PMIP NCAR CESM2 ... v20200110 NaN /glade/collections/cmip/CMIP6/PMIP/NCAR/CESM2/...
3 PMIP NCAR CESM2 ... v20200110 NaN /glade/collections/cmip/CMIP6/PMIP/NCAR/CESM2/...
4 PMIP NCAR CESM2 ... v20200110 105101-110012 /glade/collections/cmip/CMIP6/PMIP/NCAR/CESM2/...
[5 rows x 12 columns]
In [7]: col
Out[7]:
glade-cmip6-ESM Collection with 1506961 entries:
> 17 activity_id(s)
> 28 institution_id(s)
> 59 source_id(s)
> 97 experiment_id(s)
> 164 member_id(s)
> 35 table_id(s)
> 1028 variable_id(s)
> 12 grid_label(s)
> 59 dcpp_init_year(s)
> 295 version(s)
> 8476 time_range(s)
> 1506961 path(s)
In [8]: col_subset = col.search(institution_id="NCAR")
In [9]: col_subset.unique(columns=["activity_id", "source_id"])
Out[9]:
{'activity_id': {'count': 16,
'values': ['PMIP',
'ScenarioMIP',
'AerChemMIP',
'RFMIP',
'OMIP',
'C4MIP',
'GeoMIP',
'DCPP',
'CMIP',
'CFMIP',
'LUMIP',
'GMMIP',
'PAMIP',
'DAMIP',
'LS3MIP',
'CDRMIP']},
'source_id': {'count': 5,
'values': ['CESM2',
'CESM2-WACCM',
'CESM1-1-CAM5-CMIP5',
'CESM2-WACCM-FV2',
'CESM2-FV2']}}
Examples of how to use catalogues to input data and operate on multiple models / ensemble members.