Handle different grid coordinate formats and naming

JiaweiZhuang commented 4 years ago

This is a meta-issue summarizing those frequently-asked questions:

Coordinate naming: #5 #13 #38 #73
Boundary formatting: https://github.com/JiaweiZhuang/xESMF/issues/14#issuecomment-369686779, https://github.com/JiaweiZhuang/xESMF/issues/32#issuecomment-418909576
Upstream: pydata/xarray#475

The problem

xESMF requires the input grid objects to contain variableslon/lat of shape (n_lat, n_lon), and optionally lon_b/lat_b of shape (n_lat+1, n_lon+1) for conservative regridding.

This leads to the naming problem that the original name might be latitude, 'lat_bnds', 'latitude_bnds', and the boundary formatting problem that the original boundary array might have shape (n_lat, n_lon, 4) instead of (n_lat+1, n_lon+1). The current fix is to rename the coordinate (#5) and reformatting the boundary (https://github.com/JiaweiZhuang/xESMF/issues/14#issuecomment-369686779)

The two problems often occur together: the CF-convention uses the name lat_bnds with a shape of (n_lat, n_lon, 4).

An upstream cause is that an xarray.Dataset has no notion of cell boundaries; other packages like xgcm tries to workaround that (#13). Xarray also does not force CF-convention.

Desired features for the solution

1. Unambiguous

xesmf should always be very clear and strict on the expected grid format. This prevents tricky edge cases and user confusion. What if the input dataset/dictionary contains all the three variables lat_b, 'latitude_b', 'latitude_bnd'? Which one is picked? Or should it throw a "duplication error"? (https://github.com/JiaweiZhuang/xESMF/pull/38#issuecomment-431536946)

There can be options to choose one of the names, but at a time there should only be one valid name, which can be printed out by something like xesmf.get_config().

The expected boundary array format also has to be explicit. There can be extra, preprocessing function to reformat the boundary, but such option has to be set explicitly so that users are aware of what they are doing.

2. Simple

This is a simple problem and needs a simple, intuitive solution. Although being annoying, such issue does not affect xesmf's core functionality (the regridding computation). I am hesitate to put complex preprocessing logic or clever heuristics to guess the name and tweak the grid format. Complex code adds maintenance cost, and causes confusion for users who do not need such feature.

An alternative, simple "fix" to this issue, is to add some example notebooks showing how to preprocess various different grid formats, without complicating the package source code.

If changes are made in source code, there won't be too many lines of new code and many if/else switches - those are an indication of complex logic.

Proposed solutions

1. Allow custom coordinate naming, by implementing xesmf.config.set(grid_name_dict=...) as global config or context manager.

(Originally proposed at https://github.com/JiaweiZhuang/xESMF/pull/38#issuecomment-432028269)

The grid_name_dict exactly follows xarray.Dataset.rename(name_dict). xesmf.config.set works similarly as xarray.set_options or dask.config.get. There should also be a xesmf.config.get('grid_name_dict') to print the current expected grid name.

Example usage:

import xesmf as xe
xe.config.get('grid_name_dict')  # prints {'lon': 'lon', 'lat': 'lat', 'lon_b': 'lon_b', 'lat': 'lat_b'}, which is the default

grid_name_dict = {'lon': 'longitude', 'lat': 'latitude', 
                  'lon_b': 'longitude_bnds', 'lat_b': 'latitude_bnds'}

with xe.config.set(grid_name_dict=grid_name_dict):  # set it regionally
    xe.Regridder(...)  # automatically use user-specified `grid_name_dict`
    xe.config.get('grid_name_dict')  # prints the modified config

xe.config.set(grid_name_dict=grid_name_dict)  # or set it globally
xe.Regridder(...)  # always uses the new global config

xe.config.refresh()  # back to default, similar to dask.config.refresh()

xesmf.config.set might also be used to set other general configurations, although I haven't thought of an example. If there're no other configurable parameters, can also just implement a single-purpose function xesmf.set_grid_name() / xesmf.get_grid_name().

2. Implement utility functions for reformatting cell boundaries

Change (n_lat, n_lon, 4) to (n_lat+1, n_lon+1), similar to OCGIS https://github.com/JiaweiZhuang/xESMF/issues/32#issuecomment-419496538

grid_bounds = xesmf.util.convert_corners(grid_bounds_with_4_corners)

Another very useful util is inferring boundaries from centers https://github.com/JiaweiZhuang/xESMF/issues/13#issuecomment-416947178:

grid_with_bounds = xesmf.util.infer_bounds(grid_without_bounds)

Optionally, those functions can be wrapped in the high-level API like xe.Regridder(..., boundary_format='4_corners') or xe.Regridder(..., boundary_format='inferred').

3. Simple support for CF-convention, built on step 1 and 2

Given the popularity of CF-convention, it makes sense to support such input data out-of-box (#38 #73). I emphasize "simple" because xesmf has no reason to check all CF-compliant attributes such as unit = 'degrees_east' or standard_name = 'latitude' -- this is not the task for a regridding package.

For coordinate naming, can just set xesmf.config.set(grid_name_dict=xe.config.cf_grid_name), where

xe.config.cf_grid_name = {
    'lon': 'longitude', 'lat': 'latitude', 
    'lon_b': 'longitude_bnds', 'lat'_b: 'latitude_bnds'}

is pre-defined for convenience. Can also add more pre-defined dictionaries for other names like latitude_bounds, lat_bnds, or simply let users set their own.

The boundary formatting should explicitly go through step 2. Handling the boundary decoding automatically can often lead to corner cases and errors. For example what if the input grid is a 4-tile grid of shape (n_lat, n_lon, 4), but gets mis-interpreted as 4 corners? Should the Regridder throw an error when seeing 3-D grids, or check whether it is another representation?

Any comments & suggestions? PRs are particularly welcome, especially on the 'grid_name_dict' part.

huard commented 4 years ago

I support the idea of using the CF-Convention as the standard xESMF is designed around. To me this would imply that

whatever variable has long_name=latitude is the latitude variable,
latitude boundaries are identified by the bounds attribute of the latitude variable.

Using this approach, there's no need to rename variables before and after regridding.

However, I can't find mentions of the definition of lat_b, lon_b as used by xESMF in the CF-convention. If we follow the CF as the default logic, the default behavior would be for xESMF to use the conventioned lat_bnds, lon_bnds and convert them to the lat_b, lon_b required by ESMF.

However, this would break backward compatibility, and does not provide a mechanism to pass pre-computed bounds. A solution to this would be to have a var_names dict argument with keys lat, lon, blat, blon that explicitly identifies the variables, and overrides the CF approach when provided. This dictionary can be set by default to current values with a deprecation warning, then set to None in a later release.

jthielen commented 4 years ago

I also agree that supporting CF conventions would be very useful, and would allow me to remove a quite a few kludges in my code when using xESMF.

With CF convention support in mind, there are many scattered efforts across the Pangeo ecosystem to interpret CF conventions to permit more intuitive APIs and cleaner code. For example:

MetPy, which I've worked on quite a bit
geoxarray
xgcm

Would this be something where a common cf-xarray package would be useful? It could provide utilities for working with the high-level components covered by the conventions:

data description (units [perhaps by way of pint-xarray], standard names)
coordinate types (e.g., X, Y, Z, T; useful to xESMF in recognizing lats and lons)
coordinate systems (CRSs, like geoxarray is focused on)
cell representations (like what xgcm and xESMF would appear to need)
discrete geometries (e.g., https://github.com/Unidata/MetPy/pull/1074)

Tagging some people I've heard bring up an idea like this before:

@dcherian, since I recall seeing a comment of yours on Gitter about a cf-xarray type package to support APIs like .sum(dim="X") to sum a variable over its "X" dimension
@djhoese, since geoxarray has had some overlapping goals with what a cf-xarray package would be
@rabernat, for an xgcm and Pangeo perspective

djhoese commented 4 years ago

Thanks @jthielen. Another person I think should be kept in the loop in this discussion is @snowman2 who maintains pyproj and rioxarray. I bring these projects up because:

pyproj has the ability to convert CRS information to/from CF definitions. I recently switched some of my projects to depending on it for this.
rioxarray is a good example of using xarray accessors to deal with various headaches that this project may run in to. For example, defining which dimensions represent which geographic dimension ("x" and "y" versus "lon" and "lat" or any other odd naming that may exist in the wild).

The combination of these two projects has resulted in something that I've thought is much better than what I was attempting in geoxarray. @snowman2 creates a "spatial_ref" coordinate variable which itself has a crs_wkt attribute (the WKT version of the Coordinate Reference System). I think you can then also copy other coordinate variables like x/y to this spatial_ref variable. This has the benefit of holding on to CRS information and making it easy to access where xarray may have dropped .attrs or coordinate variables in other implementations.

Edit: I should have mentioned that I'd like geoxarray into something similar to rioxarray but not rasterio specific (no rasterio/gdal dependency).

snowman2 commented 4 years ago

@djhoese thanks for the mention, hopefully I can provide something useful to the conversation.

I am hesitate to put complex preprocessing logic or clever heuristics to guess the name and tweak the grid format. Complex code adds maintenance cost, and causes confusion for users who do not need such feature.

An alternative, simple "fix" to this issue, is to add some example notebooks showing how to preprocess various different grid formats, without complicating the package source code.

rioxarray looks for latitude, longitude, x and y and then stops code link. Too many possible names to look for and as it was mentioned in this issue, the logic would get pretty messy to try to auto-guess as it could be anything.

if the dimension is not set, then rioxarray provides instructions to the user for how to handle it in an exception. code link

huard commented 4 years ago

I agree something like cf-xarray would be useful. I'm concerned however that diving into this topic here could distract from the issue at hand.

Is there a place where all the potential use cases for cf-xarray could be listed, with links to existing code ? I think having this in one place would be helpful in designing an API for cf-xarray.

dcherian commented 4 years ago

discourse.pangeo.io?

rabernat commented 4 years ago

My $0.02: cf-xarray is definitely the right way to go. I envision a lightweight library for parsing cf metadata that all these packages can depend on.

Given the technical nature of this discussion, maybe we can keep it on github. An issue on https://github.com/pangeo-data/pangeo would be appropriate.

jthielen commented 4 years ago

I agree something like cf-xarray would be useful. I'm concerned however that diving into this topic here could distract from the issue at hand.

Is there a place where all the potential use cases for cf-xarray could be listed, with links to existing code ? I think having this in one place would be helpful in designing an API for cf-xarray.

discourse.pangeo.io?

My $0.02: cf-xarray is definitely the right way to go. I envision a lightweight library for parsing cf metadata that all these packages can depend on.

Given the technical nature of this discussion, maybe we can keep it on github. An issue on https://github.com/pangeo-data/pangeo would be appropriate.

Sounds good. I can plan on writing something up today to get that discussion started over there (unless someone beats me to it!), and then leave this issue alone w.r.t. cf-xarray until it would be ready to use.

Update: see https://github.com/pangeo-data/pangeo/issues/771

stefraynaud commented 4 years ago

So, has it been decided to add the ability to find geographic coordinates? Or should xesmf be prevented from inspecting the datasets?

If no inspection should be performed, a light weight version of #94 is appropriate, otherwise a mix between #73 and #94 is probably the most robust solution.

huard commented 4 years ago

My understanding is that there is a will to "support the CF-convention". To me, this means that if a file complies with CF, xESMF should be able to identify the grid centers and/or corners automatically.

I second your suggestion to mix #73 and #94.I'll review your PR.

JiaweiZhuang / xESMF