Standard column naming convention for catalogs pointing to raw CESM output on glade / campaign

mnlevy1981 commented 4 years ago

We currently have three catalogs for different CESM output accessible from cheyenne / dav (excluding the CMOR-ized CMIP output): campaign-cesm2-cmip6-timeseries, glade-cesm1-cmip5-timeseries, and glade-cesm1-le (which should actually point to data on campaign storage and be renamed campaign-cesm1-le. I think all three of these should follow the same naming convention for columns in the csv file, and should include

experiment
case
file_fullpath
file_basename
date_range
sequence_order
member_id
component
grid
stream
variable
year_offset
parent_experiment
parent_member_id
branch_year_in_parent
branch_year_in_child
pertlim

With the following notes:

If branch_year_in_parent == branch_year_in_child, can we define the catalog with a YAML file that simply specifies branch_year and sets both columns to that one value?
If pertlim is not specified in the YAML file, it should be set to zero.

I've toyed with the idea of including a machine column as well, namely as a way to note the differences between ensemble members 101 - 105 and 001 - 005 in the CESM1 Large Ensemble, but I think that might be too burdensome when creating future catalogs. I'm open to other peoples' thoughts on that, though.

Note that this issue supercedes #48 and #53 and a solution will do the same to PR #49 so I will close them in favor of tracking conversation in a single place (namely this ticket).

mnlevy1981 commented 4 years ago

@jeffdlb recommends adding long_name and dim (2D or 3D) and splitting date_range into start_year and end_year for easier searching.

sherimickelson commented 4 years ago

@jeffdlb and @mnlevy1981 and @andersy005 and @kmpaul and @bonnland I just wanted to add some references in case we want to create some kind of data structure that holds the long_name descriptions for CESM variables as @jeffdlb suggested. These lists contain variable names, long names, and units. I'm not sure where we can find info on the dimensionality of each variable.
cam variables (atm) pop variables (ocn) clm variables (lnd) cice variables (sea ice)

mnlevy1981 commented 4 years ago

@sherimickelson That's a good idea! I'll add MARBL variables (ocn biogeochemistry) to the list. In this link, vertical_grid : none => 2D, all other values (layer_avg may be the only one right now) => 3D

mnlevy1981 commented 4 years ago

I'll go ahead and update the 3 glade / campaign CESM catalogs (CMIP5 raw output, CMIP6 raw output, and LENS) with everything except longname (current tool can't crack netCDF to fill that column)

NCAR / intake-esm-datastore

Standard column naming convention for catalogs pointing to raw CESM output on glade / campaign #64