NCAR / cesm-lens-aws

Examples of analysis of CESM LENS data publicly available on Amazon S3 (us-west-2 region) using xarray and dask
https://doi.org/10.26024/wt24-5j82
BSD 3-Clause "New" or "Revised" License
43 stars 23 forks source link

Rant: bad approach to vertical coordinates #42

Closed jeffdlb closed 3 years ago

jeffdlb commented 4 years ago

In another thread, it was pointed out that the different vertical coordinates signify different locations in the level: z_t is the center, while z_w_top is the top of the level and z_w_bot is the bottom of the level. The response to the question "Should we somehow unify the dimension names for vertical levels, to simplify future user interaction with the data, or is it important to keep them distinct?" was "Keep them distinct, please." [Originally posted by @mnlevy1981 in https://github.com/NCAR/cesm-lens-aws/issues/34#issuecomment-596026185]

In my opinion this should be Considered Harmful on the part of modelers! In terms of future reusability of data, this sort of boutique nomenclature is little better than oral tradition handed down through the generations.

Each coordinate axis should always have the same name (e.g., X, Y, Z, T). Each variable data file that contains data along those axes should specify which coordinate reference system (CRS) is used for each axis for that particular file, using an agreed-upon set of standard CRS names and definitions.

The correct approach has long (>25 years ago) been developed by the Open Geospatial Consortium and others, and enshrined in common formats such as GeoTIFF. For example, the east-west coordinate X might be in units of degrees based on the Universal Transverse Mercator (UTM) projection, or of feet in Colorado State Plane, but it's always called "X" as the fundamental name in data files and in web API calls.

By analogy, we define "temperature" as a variable and separately state what units the temperature is in, rather than having temp_kelvin and temp_celsius and temp_reaumur depending on someone's particular preference. (I will avoid including in this particular rant my objection to every model naming variables in non-standard ways).

In my opinion, we should fix these types of problems in the data we publish in Zarr in the cloud. We are already converting to a new analysis-optimized format; leaving in these gratuitous non-interoperable differences is a bad idea -- it perpetuates instead of solving the problem, and these are new derived datasets with new DOIs for a broader audience and therefore should not be limited to the traditional way of doing things.

-Jeff DLB

mnlevy1981 commented 4 years ago

The problem is that there are variables (like temperature) that are computed the center of the cell, while other variables (such as the vertical velocity) that are computed at the top of the cell and still others (such as some associated with vertical mixing) that are computed at the bottom of the cell. So the vertical placement of kth level of TEMP, WVEL, and KVMIX are all at different depths for the same k. I'm more used to making this type argument in the horizontal, so let me shift gears just a little bit... but the following paragraph would also apply to the multiple vertical grids.

Given the fact that we are using an Arakawa B-grid for horizontal operators, what do you propose we call X and Y? The options seem to be "let X be the T-grid points and interpolate all variables on the U-grid to the T-grid", "let X be the U-grid points and interpolate all variables on the T-grid to the U-grid", "let 'X' be the combination of the two grids and insert missing values (or interpolate) where appropriate", or "store two separate horizontal grids." I'd argue strenuously against introducing numerical error by interpolating, and I think the disk space required to double the storage of basically all our variables prevents combining the two grids into one (note that treating the vertical in a similar manner would more than quadruple storage needed for 3D variables -- you'd be going from 60 x 384 x 320 to 121 x 768 x 640), so that leaves us with storing the T grid and U grid separately.

These aren't "gratuitous non-interoperable differences", this is the reality of looking at output produced on the staggered computational grid determined to be the best choice for the numerical methods employed in the model.

jeffdlb commented 4 years ago

@mnlevy1981 Thank you for replying. Yes, I understand that different variables are computed at grid cell centers vs edges, and I do not suggest we interpolate variables or double storage.

What I am suggesting is that in the output files (or rather, in our subsequent Zarr stores), we make those differences explicit in a different way. Rather than different axes z_t, z_w_top, and z_w_bot, there should be one axis Z that has a Coordinate Reference System attribute Z_crs which, for each particular variable, says in effect "this variable uses the cell-center vertical CRS" (or the cell-top or cell-bottom CRS). Z_crs should actually be the URL of a machine-readable definition.

It's not a change in how we compute, it's a change in the semantics of how we tell users what was computed and on what coordinate system.

mnlevy1981 commented 4 years ago

Oh, gotcha -- sorry for the misunderstanding. So basically, you like the horizontal approach of a single nlat and nlon, and then using the coordinate attribute to note whether it's T-grid or U-grid:

        float TEMP(time, z_t, nlat, nlon) ;
                TEMP:long_name = "Potential Temperature" ;
                TEMP:units = "degC" ;
                TEMP:coordinates = "TLONG TLAT z_t time" ;
        float UVEL(time, z_t, nlat, nlon) ;
                UVEL:long_name = "Velocity in grid-x direction" ;
                UVEL:units = "centimeter/s" ;
                UVEL:coordinates = "ULONG ULAT z_t time" ;

and are wondering why we can't do the same in z

        float TEMP(time, nz, nlat, nlon) ;
                TEMP:long_name = "Potential Temperature" ;
                TEMP:units = "degC" ;
                TEMP:coordinates = "TLONG TLAT z_t time" ;
        float WVEL(time, nz, nlat, nlon) ;
                WVEL:long_name = "Vertical Velocity" ;
                WVEL:units = "centimeter/s" ;
                WVEL:coordinates = "TLONG TLAT z_w_top time" ;
        float KVMIX(time, nz, nlat, nlon) ;
                KVMIX:long_name = "Vertical diabatic diffusivity due to Tidal Mixing + background" ;
                KVMIX:units = "centimeter^2/s" ;
                KVMIX:coordinates = "TLONG TLAT z_w_bot time" ;

That's a reasonable approach, though given the fact that most of the CESM ocean software engineering is focused on the transition from POP to MOM I don't think it would happen at the model level. So the big question (that I don't have an answer to) is "is it okay if the cloud data differs from the model-generated data available on glade in this manner."

jeffdlb commented 4 years ago

Yes, except that (a) the units attribute would use a controlled vocabulary such as UDUNITS and (b) the coordinates attribute(s) would be URL(s) of machine-readable definitions for each coordinate instead of a human-readable string that lists multiple coordinates and requires prior knowledge or that users RTFM.

rabernat commented 4 years ago

The CF conventions define the axis attribute explicitly to address this: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#coordinate-types

See, for example, their 2D lat lon coordinates example:

dimensions:
  xc = 128 ;
  yc = 64 ;
  lev = 18 ;
variables:
  float T(lev,yc,xc) ;
    T:long_name = "temperature" ;
    T:units = "K" ;
    T:coordinates = "lon lat" ;
  float xc(xc) ;
    xc:axis = "X" ;
    xc:long_name = "x-coordinate in Cartesian system" ;
    xc:units = "m" ;
  float yc(yc) ;
    yc:axis = "Y" ;
    yc:long_name = "y-coordinate in Cartesian system" ;
    yc:units = "m" ;
  float lev(lev) ;
    lev:long_name = "pressure level" ;
    lev:units = "hPa" ;
  float lon(yc,xc) ;
    lon:long_name = "longitude" ;
    lon:units = "degrees_east" ;
  float lat(yc,xc) ;
    lat:long_name = "latitude" ;
    lat:units = "degrees_north" ;

This is the right way to do it. This approach can easily be extended to the Z axis, with multiple different dimension variables (e.g. z_w, z_t) all sharing the same axis = "Z" attribute. Many tools, e.g. xgcm, can parse these conventions.

So POP output is not following CF conventions.

I believe the best path forward for convincing modeling groups to modify their output is to cite the CF conventions document. If we think the CF conventions need some changes, there is also a process for doing that.