Open znichollscr opened 5 months ago
@durack1 I had a look at the CF-conventions. Assuming that we generally try to follow those (maybe a big if):
As far as I can tell, for the global- and hemispheric-means, the CF-compliant way to do this would be to designate these timeseries as discrete sampling geometries of type timeSeries
and then use a bounds variable or something to indicate the extent of the geographical region they cover.
There are standardised regions in CF-conventions, see https://cfconventions.org/Data/standardized-region-list/standardized-region-list.current.html. However, they explicitly say that regions like hemispheres should use coordinate ranges instead.
Do you know anyone who works on CF-conventions that we could ask about this? It looks like there is an answer, I just can't work out where exactly to look to find it/decode the docs properly.
@znichollscr - I guess you are referring to https://cfconventions.org/cf-conventions/cf-conventions.html#geographic-regions and section 9 https://cfconventions.org/cf-conventions/cf-conventions.html#discrete-sampling-geometries. And this paragraph suggests how to handle the hemispheric means, not optimal though but this is what they have for now. "Future versions of CF will generalize the concepts of geolocation to encompass these cases. As of CF version 1.6 such data can be stored using the representations that are documented here by two means: 1) by utilizing the orthogonal multidimensional array representation and omitting the featureType attribute; or 2) by assigning arbitrary coordinates to the mandatory dimensions. For example a globally-averaged latitude position (90s to 90n) could be represented arbitrarily (and poorly) as a latitude position at the equator."
And this paragraph suggests how to handle the hemispheric means, not optimal though but this is what they have for now
Yes that what was what I was thinking of. The issue is that tells me what to put in the lat box (just pick a lat, it doesn't matter which). It doesn't tell me how to capture the region over which the mean was taken. I can't use lat bounds because the bounds are different for the different timeseries. This stuff about timeSeries
seems to be the right way, I just can't find an example I can follow. Specifically, I can't tell what metadata/co-ordinates to use to exactly specify the area over which each timeseries was calculated. Maybe it's meant to be something like "mean: lat where lat > 0", "mean: lat where lat < 0" as part of the timeseries' metadata.
@znichollscr there are several cases where "sector" is used across the forcing datasets. In most cases these were to collapse separate sector contributions (in the case of emissions) into a single variable, so to reduce the total variable counts that contained different numbers on the same grid/coordinates. In the example below it's not regions represented, like what you are noting above, but rather contributions from agriculture, energy, etc, such logic could be reused across geographical regions, but we'd have to think about this, and it would likely only make sense if these were integrated quantities
lat:realtopology = "linear" ;
lat:standard_name = "latitude" ;
int sector(sector) ;
sector:long_name = "sector" ;
sector:bounds = "sector_bnds" ;
sector:ids = "0: Agriculture; 1: Energy; 2: Industrial; 3: Transportation; 4: Residential, Commercial, Other; 5: Solvents production and application; 6: Waste; 7: International Shipping" ;
double time(time) ;
time:units = "days since 1750-01-01 0:0:0" ;
time:long_name = "time" ;
time:calendar = "365_day" ;
time:axis = "T" ;
time:bounds = "time_bnds" ;
time:realtopology = "linear" ;
time:standard_name = "time" ;
float SO2_em_anthro(time, sector, lat, lon) ;
SO2_em_anthro:units = "kg m-2 s-1" ;
SO2_em_anthro:_FillValue = 1.e+20f ;
SO2_em_anthro:long_name = "SO2 Anthropogenic Emissions" ;
SO2_em_anthro:cell_methods = "time: mean" ;
SO2_em_anthro:missing_value = 1.e+20f ;
double lat_bnds(lat, bound) ;
double lon_bnds(lon, bound) ;
double time_bnds(time, bound) ;
double sector_bnds(sector, bound) ;
// global attributes:
:Conventions = "CF-1.6" ;
:activity_id = "input4MIPs" ;
:comment = "This data supersedes 2016-06-18, 2016-06-18-sectorDimV2, 2016-07-26, and 2016-07-26-sectorDim data versions. See README file at the project web site." ;
:contact = "Steven J. Smith (ssmith@*gov)" ;
:creation_date = "2017-05-19T06:35:19Z" ;
:data_structure = "grid" ;
:dataset_category = "emissions" ;
:dataset_version_number = "2017-05-18" ;
:external_variables = "gridcell_area" ;
:frequency = "mon" ;
:further_info_url = "http://www.globalchange.umd.edu/ceds/" ;
:grid = "0.5x0.5 degree latitude x longitude" ;
:grid_label = "gn" ;
:history = "19-05-2017 06:35:19 AM UTC; College Park, MD, USA" ;
:institution_id = "PNNL-JGCRI" ;
:mip_era = "CMIP6" ;
:product = "primary-emissions-data" ;
:realm = "atmos" ;
:references = "Hoesly, R. M., Smith, S. J., Feng, L., Klimont, Z., Janssens-Maenhout, G., Pitkanen, T., Seibert, J. J., Vu, L., Andres, R. J., Bolt, R. M., Bond, T. C., Dawidowski, L., Kholod, N., Kurokawa, J.-I., Li, M., Liu, L., Lu, Z., Moura, M. C. P., O\'Rourke, P. R., and Zhang, Q.: Historical (1750-2014) anthropogenic emissions of reactive gases and aerosols from the Community Emission Data System (CEDS), Geosci. Model Dev. Discuss., doi:10.5194/gmd-2017-43, in review, 2017." ;
:source = "CEDS-2017-05-18: Community Emissions Data System (CEDS) for Historical Emissions" ;
:table_id = "input4MIPs" ;
:target_mip = "CMIP" ;
:title = "Annual Anthropogenic Emissions of SO2 prepared for input4MIPs" ;
:variable_id = "SO2_em_anthro" ;
:global_total_emission_1750 = "0.48 Tg/year" ;
:global_total_emission_1799 = "0.69 Tg/year" ;
:data_usage_tips = "Note that these are monthly average fluxes." ;
:reporting_unit = "Mass flux of SOx, reported as SO2" ;
:nominal_resolution = "50 km" ;
:institution = "Pacific Northwest National Laboratory - Joint Global Change Research Institute, College Park, MD 20740, USA" ;
:source_id = "CEDS-2017-05-18" ;
:tracking_id = "hdl:21.14100/c25fca8d-7edb-41cf-9e76-6b7430c04a05" ;
}
file: input4MIPs/CMIP6/CMIP/PNNL-JGCRI/CEDS-2017-05-18/atmos/mon/SO2-em-anthro/gn/v20170519/SO2-em-anthro_input4MIPs_emissions_CMIP_CEDS-2017-05-18_gn_175001-179912.nc URL here
there are several cases where "sector" is used across the forcing datasets. In most cases these were to collapse separate sector contributions (in the case of emissions) into a single variable, so to reduce the total variable counts that contained different numbers on the same grid/coordinates
Yep that makes total sense (because in the case of emissions, they literally are sectors).
Put another way, two questions:
@znichollscr there's no real "have to", rather the intention through iteration in CMIP6 was to simplify data as much as possible and bring it toward the CMIP single variable per file, meeting CF metadata conventions and containing enough metadata in the global attributes to identify these data.
RE: 2, CF conventions, @taylor13 is definitely a full bottle on these, and is actively part of ongoing CF specification discussions - we also have a number of others across the WIP, but Karl would be my first stop
Thanks. So @taylor13, just trying to summarise the above so you don't have to read us going round in circles:
We have a file with 3 time series: global-mean, northern-hemisphere mean, southern-hemisphere mean. What is the CF-compliant way to capture this?
@znichollscr just answering after consulting with @taylor13. We have examples in the CMIP6 archive, with the following below:
file link here
(base) -bash-4.2$ ncdump -h /p/css03/esgf_publish/CMIP6/CMIP/CNRM-CERFACS/CNRM-CM6-1/historical/r20i1p1f2/Omon/hfbasin/gn/v20191004/hfbasin_Omon_CNRM-CM6-1_historical_r20i1p1f2_gn_185001-201412.nc
netcdf hfbasin_Omon_CNRM-CM6-1_historical_r20i1p1f2_gn_185001-201412 {
dimensions:
j-mean = 294 ;
basin = 3 ;
str_len = 255 ;
time = UNLIMITED ; // (1980 currently)
axis_nbounds = 2 ;
variables:
float j-mean(j-mean) ;
j-mean:name = "j-mean" ;
j-mean:long_name = "Ocean grid longitude mean" ;
j-mean:units = "-" ;
char sector(basin, str_len) ;
sector:name = "sector" ;
sector:standard_name = "region" ;
sector:long_name = "ocean basin" ;
sector:units = "1" ;
double time(time) ;
time:axis = "T" ;
time:standard_name = "time" ;
time:long_name = "Time axis" ;
time:calendar = "gregorian" ;
time:units = "days since 1850-01-01 00:00:00" ;
time:time_origin = "1850-01-01 00:00:00" ;
time:bounds = "time_bounds" ;
double time_bounds(time, axis_nbounds) ;
float hfbasin(time, basin, j-mean) ;
hfbasin:long_name = "Northward Ocean Heat Transport" ;
hfbasin:units = "W" ;
hfbasin:online_operation = "average" ;
hfbasin:cell_methods = "longitude: mean (basin) time: mean" ;
hfbasin:interval_operation = "1800 s" ;
hfbasin:interval_write = "1 month" ;
hfbasin:_FillValue = 1.e+20f ;
hfbasin:missing_value = 1.e+20f ;
hfbasin:coordinates = "sector" ;
hfbasin:comment = "This variable has an axis labelled j-mean, while CMIP6 calls for an axis labelled latitude. We want here to pinpoint that we provide values which are averaged over the X-axis of our tripolar grid, along which latitude do vary. This axis begins South.Please refer to the lat/lon coordinate variables in this file for further details." ;
hfbasin:standard_name = "northward_ocean_heat_transport" ;
hfbasin:description = "Contains contributions from all physical processes affecting the northward heat transport, including resolved advection, parameterized advection, lateral diffusion, etc. Diagnosed here as a function of latitude and basin. Use Celsius for temperature scale." ;
hfbasin:history = "none" ;
// global attributes:
...
Which then has the lookup basin
coordinate defined below
file link here
(base) -bash-4.2$ ncdump -h /p/css03/esgf_publish/CMIP6/CMIP/CNRM-CERFACS/CNRM-CM6-1/historical/r20i1p1f2/Ofx/basin/gn/v20191004/basin_Ofx_CNRM-CM6-1_historical_r20i1p1f2_gn.nc
netcdf basin_Ofx_CNRM-CM6-1_historical_r20i1p1f2_gn {
dimensions:
axis_nbounds = 2 ;
x = 362 ;
y = 294 ;
nvertex = 4 ;
time = UNLIMITED ; // (0 currently)
variables:
double lat(y, x) ;
lat:standard_name = "latitude" ;
lat:long_name = "Latitude" ;
lat:units = "degrees_north" ;
lat:bounds = "bounds_lat" ;
double lon(y, x) ;
lon:standard_name = "longitude" ;
lon:long_name = "Longitude" ;
lon:units = "degrees_east" ;
lon:bounds = "bounds_lon" ;
double bounds_lon(y, x, nvertex) ;
double bounds_lat(y, x, nvertex) ;
short basin(y, x) ;
basin:standard_name = "region" ;
basin:long_name = "Region Selection Index" ;
basin:units = "1" ;
basin:online_operation = "once" ;
basin:cell_methods = "area: mean" ;
basin:_FillValue = 0s ;
basin:missing_value = 0s ;
basin:coordinates = "lat lon" ;
basin:description = "Region Selection Index" ;
basin:history = "none" ;
basin:cell_measures = "area: areacello" ;
basin:flag_meanings = "global_land southern_ocean atlantic_ocean pacific_ocean arctic_ocean indian_ocean mediterranean_sea black_sea hudson_bay baltic_sea red_sea" ;
basin:flag_values = "0 1 2 3 4 5 6 7 8 9 10" ;
// global attributes:
...
@znichollscr just answering after consulting with @taylor13. We have examples in the CMIP6 archive, with the following below
Thanks. The file linked uses standard regions, following the names here https://cfconventions.org/Data/standardized-region-list/standardized-region-list.current.html.
However, that same page also has the advice below
We excluded: (a) ... (b) regions that could be specified by coordinate ranges in CF (e.g. western hemisphere); ...
So it seems to me like this is not the recommended way to do this for hemispheric means. The fact there are no regions like "northern_hemisphere" or "southern_hemisphere" suggests that doing it this way for hemispheric means is not even supported.
Is the advice at https://cfconventions.org/Data/standardized-region-list/standardized-region-list.current.html incorrect? Or is there another recommended way to do this?
There's a little bit of duplication in discussions across #29 and here
Yes, I'll figure it out
Feedback from @vnaik60:
Yep very good points that will help clarify things.
@durack1, my feeling has always been that using 'sector' to denote different areas over which the average was taken is a hack. In particular, it means that every user of the data has to write their custom script to parse the sectors and then apply them sensibly to the data. Is there a reason it is like this, rather than e.g. just having a dimension called e.g. 'averaging_region' that just had the values 'gm', 'nh' and 'sh'? This seems like it would be much simpler for everyone to me.
If there is some legacy reason that we have to use sector, I'll just make the changes that Vaishali suggested.