NCAR / science-at-scale

Planning and Collaborative Space for Science @ Scale Project
Creative Commons Zero v1.0 Universal
0 stars 2 forks source link

Standard column names for catalogs #4

Open sethmcg opened 3 years ago

sethmcg commented 3 years ago

We had a discussion in the monthly S@S tag-up meeting and decided on some standardization of column names in the intake-esm catalogs.

CESM NA-CORDEX STANDARD
component N/A component
frequency frequency frequency
experiment scenario *
variable variable variable
variable_long_name longname long_name
dim_per_tstep [add] vertical_levels
start [add] start_time
end [add] end_time
[add gcm = 'CESM'?]* driver *
N/A rcm rcm
N/A grid grid
[add = 'raw'?] biascorrection bias_correction
[add] units units
[add] [add] standard_name
path path path

* NA-CORDEX uses 'driver' and 'scenario' instead of 'gcm' and 'experiment' because there are some simulations whose boundary conditions come from ERA-Interim, which is (technically speaking) a reanalysis, not a GCM. So 'scenario' is a superset of 'experiment', which includes 'era-int' in addition to 'historical', 'rcp85', etc., For LENS, it would probably make more sense to call it 'gcm'.

"vertical_levels" is an integer indicating the number of vertical levels; for a 2-D variables, it's '1'.

The NA-CORDEX 'grid' variable covers both spatial resolution and spatial domain. We'll probably also want to add some information about the spatial domain to the catalog metadata, but that may be a top-level element rather than a column. We'll probably want both a lat-lon bounding box and a human-readable "region" string; the spatial extent of the array is constant, but where there's non-missing data can vary. (E.g., data bias-corrected with Daymet covers North America (land-only), while data bias-corrected with gridMET covers only CONUS.) A region string might also apply to LENS, since atm data is global but ocean and ice data is not.

Note: column ordering doesn't generally matter, but 'path' should come at the end for legibility when the tables are printed out.

jeffdlb commented 3 years ago

Thanks for sending this out, Seth.

Some thoughts below. Not all of these need to be implemented immediately -- or at all if they are bad ideas -- but I just want to put them out to provoke discussion.

(1) long_name is specific to the modeling project. In future it would be good to also have a 'cf_standard_name' column (recognizing that several variables may actually fit under the rubric of a given CF Name).

(2) Should 'rcm' be spelled out as regional_climate_model for greater understandability and consistency with the others? [ignore if #3 adopted]

(3) Perhaps we should think a bit more about how we handle the source-related columns like experiment+gcm for LENS and scenario+rcm+bias_correction for NA-CORDEX.

(4) We might wish to include a 'metadata' column that has URL of MD record. Intake-ESM allows for basic discovery, but somewhere we should point to the full metadata.

(5) We might wish to define a 'comment' column that has free text as needed.

Regards, Jeff DLB

Jeff de La Beaujardiere, PhD Director, NCAR/CISL Information Systems Division https://staff.ucar.edu/users/jeffdlb https://orcid.org/0000-0002-1001-9210

On Tue, Jan 12, 2021 at 2:49 PM Seth McGinnis notifications@github.com wrote:

We had a discussion in the monthly S@S tag-up meeting and decided on some standardization of column names in the intake-esm catalogs. CESM NA-CORDEX STANDARD component N/A component frequency frequency frequency experiment scenario variable variable variable variable_long_name longname long_name dim_per_tstep [add] levels start [add] start_time end [add] end_time [add gcm = 'CESM'?] driver * N/A rcm rcm N/A grid grid [add = 'raw'?] biascorrection bias_correction [add] units units [add] [add] standard_name path path path

  • NA-CORDEX uses 'driver' and 'scenario' instead of 'gcm' and 'experiment' because there are some simulations whose boundary conditions come from ERA-Interim, which is (technically speaking) a reanalysis, not a GCM. So 'scenario' is a superset of 'experiment', which includes 'era-int' in addition to 'historical', 'rcp85', etc., For LENS, it would probably make more sense to call it 'gcm'.

"levels" is an integer indicating the number of vertical levels; for a 2-D variables, it's '1'.

The NA-CORDEX 'grid' variable covers both spatial resolution and spatial domain. We'll probably also want to add some information about the spatial domain to the catalog metadata, but that may be a top-level element rather than a column. We'll probably want both a lat-lon bounding box and a human-readable "region" string; the spatial extent of the array is constant, but where there's non-missing data can vary. (E.g., data bias-corrected with Daymet covers North America (land-only), while data bias-corrected with gridMET covers only CONUS.) A region string might also apply to LENS, since atm data is global but ocean and ice data is not.

Note: column ordering doesn't generally matter, but 'path' should come at the end for legibility when the tables are printed out.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/NCAR/science-at-scale/issues/4, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF4W4RSGIN7K6DLOBTZTD3SZSRWNANCNFSM4V7WZURA .

aaronspring commented 3 years ago

For predictions (numerical weather prediction, seasonal (SubX,S2S) predictions or decadal predictions (DCPP)) there is an additional dimension needed, I would name that init (as to be defined in json and csv column).

for that purpose, Anderson split dcpp_init_year from member_id for experiment_id==dcppA-hindcast in CMIP6 which contains DCPP. this allows easy integration of climpred, see https://climpred.readthedocs.io/en/stable/examples/preprocessing/setup_your_own_data.html#intake-esm-for-cmorized-output. Catalog building in https://github.com/NCAR/intake-esm-datastore/blob/master/builders/notebooks/glade-cmip6_catalog_builder.ipynb: In DCPP members are named s{inityear}-{ordinary-member_id_like_r2i1p1f1}

df["dcpp_init_year"] = df.member_id.map(lambda x: float(x.split("-")[0][1:] if x.startswith("s") else np.nan))
df["member_id"] = df["member_id"].map(lambda x: x.split("-")[-1] if x.startswith("s") else x)

However, dcpp_init_year is very specific to DCPP. init meaning timestamp of initialization would fit better. although in the member naming there is also an i, e.g. r2i1p1f1 but I would rather see this i for an particular type of initialization.

PS: Hi all, @jeffdlb referred me here, I am helping @judithberner with getting SubX into the cloud https://github.com/pangeo-data/pangeo-datastore/issues/121 and we want to use intake-esm with it

sethmcg commented 3 years ago

@jeffdlb - I really like the idea of having a "source" column that sums up the dataset-specific elements. I think there's too much variation between data sources to have much hope of being able to come up with a single unifying schema that will cover all the different experiments, but with the concatenated-source approach we can use whatever elements are appropriate for the dataset and still have one column that's appropriate for comparing across datasets.

That approach also allows augmenting the source column further if needed. If at some point we determine that we need to indicate that the dataset is part of Amazon's Open Data, we can just tack on "open-aws" on the end. (In fact, I even wonder if maybe the ideal way to do it would be to just list the columns to be aggregated to generate the source column and construct it on the fly when it's needed.)

Agreed that we should probably discuss it further. How should it be concatenated? My inclination would be to use whitespace to separate elements, but are there any specific tools / formats we need to consider compatibility with?

bonnland commented 3 years ago

Agreed that we should probably discuss it further. How should it be concatenated? My inclination would be to use whitespace to separate elements, but are there any specific tools / formats we need to consider compatibility with?

Looking at the common python tools for reading/writing CSV files, the default delimiter for fields is the comma. It seems OK for now to assume that we will use this delimiter for fields, so the only other question is what sub-delimiter we want. If we can convince ourselves that a space is outside the alphabet for sub-fields (I can see why it could be), then it seems like a possible choice. Though perhaps a different choice will make the sub-fields more easily readable by people?

sethmcg commented 3 years ago

Brian and I discussed how to structure the zarr catalog for NA-CORDEX and decided it would be best to catalog data granules at the logical level (one record per distinct data element), rather than the physical level (one record per zarr store). So it will have one row for each ensemble member in each zarr store. For concatenated stores, the scenario will be e.g. "hist+rcp85".

I have updated the netcdf catalog for NA-CORDEX. The columns are now:

column description
variable short CORDEX name (tmin, prec, rsds, etc.)
scenario eval (ERA-Int), hist, rcp26, rcp45, or rcp85 (note rcp85, not rcp8.5)
driver name of global climate model providing boundary conditions (or ERA-Int for reanalysis-driven runs)
rcm name of regional climate model
frequency "day" for everything except static variables, which = "fixed"
grid "NAM-44i" = 0.5 degree resolution, "NAM=22i" = 0.25 degree; common lat-lon grid w/ cell boundaries at integer values
bias_correction "raw" for uncorrected data, "mbcn-Daymet" or "mbcn-gridMET" for bias-corrected
long_name human-readable description of data
units value of units attribute for data; supposed to be standard by variable, but may not be
standard_name CF standard name for data variable
vertical_levels always = 1, since this is all 2-D data, no 3-D
member_id "RCM.GCM"; needed for json aggregation
source concatenation of bias_correction, rcm, driver, & scenario; for comparison with other catalogs (e.g., LENS)
path absolute path to netcdf files on Glade

Note that standard_name alone is insufficient to distinguish some variables. The variables tas and temp are the same, but in units of K and degC, respectively. Units add information, but not enough; tas (daily average temperature), tasmin (daily minimum temperature) and tasmax (daily maximum temperature) all have the same units and standard_name. For NA-CORDEX, the long_name attribute has been standardized, and provides a human-readable version of the information encompassed by variable, units, and standard_name.

I have not yet added start_time or end_time; we need to discuss whether to use nominal bounds or actual bounds (which would require a lot more work).

bonnland commented 3 years ago

I have not yet added start_time or end_time; we need to discuss whether to use nominal bounds or actual bounds (which would require a lot more work).

By nominal bounds, do you mean the start and end time steps reflected in, for example, the NetCDF files for NA-CORDEX? Because this should work fine when it comes to creating Zarr stores. The concatenate/merge step for Zarr creation simply requires aligned calendar axes, and the resulting time axis will be the union of all aligned time steps.

jeffdlb commented 3 years ago

Thanks for continuing to work on this.

Brian and I discussed how to structure the zarr catalog for NA-CORDEX and decided it would be best to catalog data granules at the logical level (one record per distinct data element), rather than the physical level (one record per zarr store). So it will have one row for each ensemble member in each zarr store.

When I first read this I thought you meant fewer rows, each referencing multiple Zarr stores, but upon re-reading I wonder whether you mean multiple rows pointing to segments of the same Zarr store. Which is it? I have a concerns/questions about each approach...

Separate question:

bias_correction | "raw" for uncorrected data, "mbcn-Daymet" or "mbcn-gridMET" for bias-corrected

Does this concept apply to many/most model experiments or only to NA-CORDEX? It does not generally apply to observational data. Something like processing_level = {appropriate dataset-specific terminology) would perhaps be more generic. On the other hand,

driver | name of global climate model rcm | name of regional climate model source | concatenation of bias_correction, rcm, driver, & scenario; for comparison with other catalogs (e.g., LENS)

There is some redundancy. We should either use generic source & processing_level, or use a proliferation of specific columns like bias_correction & driver & rcm, but not both.

Path or paths plural?

I have not yet added start_time or end_time; we need to discuss whether to use nominal bounds or actual bounds (which would require a lot more work).

A separate discussion Tuesday about DASH Repository noted that some datasets have disjoint time intervals, which are difficult to accurately represent by a pair of start and end times, so a single time_range with one or more pairs of values would be better.

sethmcg commented 3 years ago

Hmm, some complicated issues that I think are going to be difficult to work out via comment thread. I think we need a real-time discussion. I'll try to schedule a meeting with me, Jeff, Anderson, and Brian. Let me know if there's anyone else I should try to rope in.

andersy005 commented 3 years ago

Let me know if there's anyone else I should try to rope in.

@mnlevy1981's input on this would be very useful since he (1) knows more about CESM data in their different forms, (2) has been trying to standardize columns for the CESM catalogs in https://github.com/NCAR/intake-esm-datastore/issues/64

mnlevy1981 commented 3 years ago

I can't make the meeting this afternoon, but will continue to follow along with this issue ticket and if there's a follow-up meeting I'll try to make it. I think the issue @andersy005 linked (https://github.com/NCAR/intake-esm-datastore/issues/64) has a complete list of the columns we expect to search by, though the initial comment is just the start and the list continues in the responses. We are happy to include other columns that are useful in other projects for the sake of consistency among catalogs.

sethmcg commented 3 years ago

Conclusions from the meeting: the primary purpose of these catalogs is to support dataset-oriented access and cross-dataset inventires. They're not the main first discovery step; users will likely have some idea about what's in the dataset by the time they arrive at the catalogs.

Although we want to avoid unlimited proliferation of columns, we need to allow for non-standard columns that capture important facets specific to the dataset. Standard columns should be validateable.

Updated intake-esm catalog structure for NA-CORDEX:

std? column valid values description / note
y variable see CORDEX variable table standard short CORDEX variable name
y long_name see CORDEX variable table controlled vocab
y units see CORDEX variable table some datasets may be non-compliant
y standard_name see CORDEX variable table CF standard name; doesn't fully distinguish variables (e.g., tasmax)
y spatial_domain CF standardized region name always = north_america
y grid NAM-44i, NAM-22i (for NA-CORDEX) 44i = 0.5 degree resolution, 22i = 0.25 degree; common lat-lon grid w/ cell boundaries at integer values
y vertical_levels integer > 0 always = 1, since this is all 2-D data, no 3-D
y frequency day, fixed, (mon, seas, ann, ymon, yseas, etc] zarr stores only have daily & fixed (static) data
y start_time ISO-8601 datetime start of time coordinates in data array (nominal; don't worry about ragged ends)
y end_time ISO-8601 datetime end of time coordinates in data array (nominal; don't worry about ragged ends)
n model list of [RCM.GCM] strings see [NA-CORDEX simulation matrix] (https://na-cordex.org/simulation-matrix.html) for full set of pairings
n scenario eval, hist, rcp26, rcp45, rcp85 GCM experiment or "eval" for ERA-Int runs; note rcp85, not rcp8.5
n bias_correction raw, mbcn-Daymet, mbcn-gridMET mbcn = method, Daymet/gridMET = obs dataset
y path [valid path] absolute path to netcdf files on Glade

Other notes:

jeffdlb commented 3 years ago

Thanks, Seth. Comments:

-Jeff

sethmcg commented 3 years ago

I added end_time. Thanks!

Thoughts on grid:

Maybe 'grid' is not the right name, but I think we want a column that captures spatial resolution, and that it should be present in every catalog, so that's why I would argue for making it a standard column.

If it's standard, it does need a controlled vocabulary, which may be hard to define up front, but GCMs have the grid descriptor strings like "T42" and "Tco199" and "C180" (I don't know the proper names for these) and CORDEX defines NAM-44i and EUR-22 and so on with a spec document, so my sense is that we could do it without being completely ad hoc; I think there's conventional nomenclature out there in the community we could leverage.

For obs data, I expect there's some equivalent set of conventions for swath and trajectory data, and for stationary observations we could just give it a value of "point" in the same way we use "frequency = fixed" for static data.

jeffdlb commented 3 years ago

Perhaps grid should be optional, and used only when the elements in the catalog need to be distinguished from each other by the choice of grid?

If used, I would prefer spatial_resolution in units of km or deg, but understand that is not always applicable so a controlled vocab may be needed.

Jeff

On Mon, Feb 8, 2021 at 4:21 PM Seth McGinnis notifications@github.com wrote:

I added end_time. Thanks!

Thoughts on grid:

Maybe 'grid' is not the right name, but I think we want a column that captures spatial resolution, and that it should be present in every catalog, so that's why I would argue for making it a standard column.

If it's standard, it does need a controlled vocabulary, which may be hard to define up front, but GCMs have the grid descriptor strings like "T42" and "Tco199" and "C180" (I don't know the proper names for these) and CORDEX defines NAM-44i and EUR-22 and so on with a spec document, so my sense is that we could do it without being completely ad hoc; I think there's conventional nomenclature out there in the community we could leverage.

For obs data, I expect there's some equivalent set of conventions for swath and trajectory data, and for stationary observations we could just give it a value of "point" in the same way we use "frequency = fixed" for static data.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NCAR/science-at-scale/issues/4#issuecomment-775529638, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF4W4VH36WTN7JDR2SVJILS6BWWRANCNFSM4V7WZURA .

judithberner commented 3 years ago

Hi - we started a github conversation about the next steps for zarrifying S2S-data. The format is the SubX format, but we are have local CESM output that follows the SubX output protocol. One of the issues is whether to use esm_intake. See issue here: https://github.com/bradyrx/climpred_CESM1_S2S/issues/1 I think we should use esm_intake if possible. Tagging: CESM team: @abjaye IRI team: @ikhomyakov, @awrobertson, @aaron-kaplan Climpred team: @aaronspring

jeffdlb commented 3 years ago

Hi Judith-

Yes, please do use intake-esm, including the extra columns that we are trying to standardize. I'm not sure what you mean by "SubX format" -- is that a convention applied to SubX data that is in a particular format such as CF/NetCDF?

-Jeff DLB

J-F de La Beaujardiere, PhD (he/him, il/lui) Director, NCAR/CISL Information Systems Division https://staff.ucar.edu/users/jeffdlb https://orcid.org/0000-0002-1001-9210

On Wed, Feb 24, 2021 at 5:36 PM judithberner notifications@github.com wrote:

Hi - we started a github conversation about the next steps for zarrifying S2S-data. The format is the SubX format, but we are have local CESM output that follows the SubX output protocol. One of the issues is whether to use esm_intake. See issue here: bradyrx/climpred_CESM1_S2S#1 https://github.com/bradyrx/climpred_CESM1_S2S/issues/1 I think we should use esm_intake if possible. Tagging: CESM team: @abjaye https://github.com/abjaye IRI team: @ikhomyakov https://github.com/ikhomyakov Climpred team: @aaronspring https://github.com/aaronspring

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NCAR/science-at-scale/issues/4#issuecomment-785488865, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF4W4SPYMQZRPXJXLXXNGTTAWLQ5ANCNFSM4V7WZURA .

judithberner commented 3 years ago

SubX is written as netcdf output. The comment pertained to the fact the the S2S simulations with CESM output the variables required by the SubX protocol on a grid required by SubX and the ensemble size is the same as that required for SubX. So any scripts we develop for the CESM S2S output should theoretically work just the same for the SubX data sets.

aaron-kaplan commented 3 years ago

Hi folks. At the invitation of @judithberner I just parachuted in to this rather long discussion, and I'm not exactly sure what I'm doing here.

I work on the IRI Data Library. We host the SubX data, and we've been talking with Judith about making it available in S3 as Zarr.

This github issue appears to be about an ESM catalog, but we were not planning on using ESM. We talked with Ryan Abernathey of Pangeo recently and he recommended that we use STAC for our catalog. We have no prior experience with either ESM or STAC, but I'm inclined to follow Ryan's advice until/unless we discover a reason not to. If you have reasons, please share them.

sethmcg commented 3 years ago

@jeffdlb The more I think about this issue, the trickier it gets.

I would like to use something straightforward like spatial_resolution, but I'm concerned that it could be seriously misleading. Three example problems that come to mind:

So my thinking was that we can't really get away from saying something about the nature of the grid in addition to its resolution or users will get confused when they run into one of these kinds of issues. And if the values came from a controlled vocabulary (or at least a somewhat standardized source), you might not know what something like "grid = T68" means when you run across it, but at least you can look it up and find out that Gaussian grids exist. But I agree that it would also be really useful to have something that gave you a rough sense of the resolution, even if it's not exact.

What if we required nominal_spatial_resolution in km or deg instead of spatial_resolution and also required or at least strongly recommended a spatial_grid entry coming from a controlled (ish) vocabulary? That seems like it would provide users with that rough sense while also signposting the fact that it's a complicated issue they should be prepared to need to investigate further.

*> Perhaps grid should be optional, and used only when the elements in the catalog need to be distinguished from each other by the choice of grid? If used, I would prefer spatial_resolution in units of km or deg, but understand that is not always applicable so a controlled vocab may be needed. Jeff

sethmcg commented 3 years ago

Update after discussion on 2021-03-12: include nominal_spatial_resolution as a required column, add grid as an optional column.

std? column valid values description / note
y variable see CORDEX variable table standard short CORDEX variable name
y long_name see CORDEX variable table controlled vocab
y units see CORDEX variable table some datasets may be non-compliant
y standard_name see CORDEX variable table CF standard name; doesn't fully distinguish variables (e.g., tasmax)
y spatial_domain CF standardized region name always = north_america
y spatial_resolution numeric with units e.g. "0.5 deg" or "25 km" (nominal; use typical value for irregular grids)
y vertical_levels integer > 0 always = 1, since this is all 2-D data, no 3-D
y frequency day, fixed, (mon, seas, ann, ymon, yseas, etc] zarr stores only have daily & fixed (static) data
y start_time ISO-8601 datetime start of time coordinates in data array (nominal; may have ragged ends)
y end_time ISO-8601 datetime end of time coordinates in data array (nominal; may have ragged ends)
n grid NAM-44i, NAM-22i (for NA-CORDEX) as defined by the CORDEX Archive Specification
n model list of [RCM.GCM] strings see [NA-CORDEX simulation matrix] (https://na-cordex.org/simulation-matrix.html) for full set of pairings
n scenario eval, hist, rcp26, rcp45, rcp85 GCM experiment or "eval" for ERA-Int runs; note rcp85, not rcp8.5
n bias_correction raw, mbcn-Daymet, mbcn-gridMET mbcn = method, Daymet/gridMET = obs dataset
y path [valid path] absolute path to netcdf files on Glade or zarr stores on object store
jeffdlb commented 3 years ago

Can we call it spatial_resolution, and just include the "nominal" aspect in the documentation, as we have done for start/end_time? I'd also suggest modifying the description of those to be "Nominal start [end] of time coordinates in data array (may have ragged ends when multiple simulations are combined)."

sethmcg commented 3 years ago

@jeffdlb Sounds good! Table above has been updated.