JiaweiZhuang / xESMF

Universal Regridder for Geospatial Data
http://xesmf.readthedocs.io/
MIT License
269 stars 49 forks source link

Adding 'latitude' and 'longitude' to acceptable coordinate names #38

Closed bbakernoaa closed 2 years ago

bbakernoaa commented 5 years ago

Adding 'latitude' and 'longitude' to acceptable coordinate names to be compliant with netCDF CF 1.6 conventions. It adds a new method called get_latlon_names that checks if 'lat' or 'latitude' or netCDF COARDS convention or netCDF CF convention is in the xr.DataArray

spencerahill commented 5 years ago

I like this. I'm wondering if it would be within reason to generalize this even further, to allow the user to specify their own lat and lon coordinate names, e.g. if their data uses 'x' and 'y' or 'LONGITUDES' and 'LATITUDES' or any other variants.

JiaweiZhuang commented 5 years ago

Thanks for the PR, that's a good point...

We need to think carefully about this because @spencerahill 's comment can be further generalized to

'y' seems a bad idea because the coordinate value is latitude, but other options all seem reasonable. But too many aliases will cause confusion. Also, what if a grid object contains multiple valid variable names?

Even worse for boundary variables:

You name it. I am happy to discuss how to handle this potential chaos...

JiaweiZhuang commented 5 years ago

One way is to have a signature like

Regridder(grid_in, grid_out, ..., lon_name = 'lon', lat_name = 'lat')

where lon_name can be overwritten by users. But is this really more convenient than simply renaming the variable?

jhamman commented 5 years ago

It might be worthwhile to take a step back and look at what functionality other popular regridding tools offer in this area. The two I'm most familiar with are NCO and CDO, both are command line tools. My understanding is that both of these tools default to using coordinate information in the dataset (netCDF file). NCO seems to be the most flexible and works like this:

So xarray provides a lot of the necessary metadata to make the first two steps possible. One suggestion would be to first look at the grid variables and see if we can determine their coordinate variables, next, look for common names, finally, if we can't find a coordinate variable, raise an error. Of course, the signature of lon_name and lat_name should be optional, probably defaulting to None.

bbakernoaa commented 5 years ago

I suggest that we support by default a limited number of them but add the option to override it with a user provided keyword

On Fri, Oct 19, 2018 at 11:02 PM Joe Hamman notifications@github.com wrote:

It might be worthwhile to take a step back and look at what functionality other popular regridding tools offer in this area. The two I'm most familiar with are NCO and CDO, both are command line tools. My understanding is that both of these tools default to using coordinate information in the dataset (netCDF file). NCO seems to be the most flexible and works like this:

  • if the grid dataset has variables with coordinates attributes, these are used to define the grid
  • if no variables have the coordinates attribute, then some basic heuristics are used to determine where to find the coordinate information
  • finally, these can all be overridden with command line options (e.g. cremap -R "--rgr lat_nm=xq --rgr lon_nm=zj" -d dst.nc -O ~/rgr in.nc # Manual)

So xarray provides a lot of the necessary metadata to make the first two steps possible. One suggestion would be to first look at the grid variables and see if we can determine their coordinate variables, next, look for common names, finally, if we can't find a coordinate variable, raise an error. Of course, the signature of lon_name and lat_name should be optional, probably defaulting to None.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JiaweiZhuang/xESMF/pull/38#issuecomment-431543405, or mute the thread https://github.com/notifications/unsubscribe-auth/AVFKt05r41xm-WYzvatab9_4a_YdIICJks5umpI4gaJpZM4Xwn5e .

spencerahill commented 5 years ago

So xarray provides a lot of the necessary metadata to make the first two steps possible. One suggestion would be to first look at the grid variables and see if we can determine their coordinate variables, next, look for common names, finally, if we can't find a coordinate variable, raise an error. Of course, the signature of lon_name and lat_name should be optional, probably defaulting to None.

I like this.

But is this really more convenient than simply renaming the variable?

Yes, I think so. Otherwise in order to use xESMF a user has to first rename their coordinate(s), then use xESMF, and then if their pipeline requires the original name down the line to rename them back to the original.

bbakernoaa commented 5 years ago

Thanks @spencerahill for catching that.

Like I said I suggest that you should support a limited number of named variables for latitude and longitude. Stick with the big conventions like COARDS and CF and add the option to override the automatic detection if the variable is provided.

JiaweiZhuang commented 5 years ago

Otherwise in order to use xESMF a user has to first rename their coordinate(s), then use xESMF, and then if their pipeline requires the original name down the line to rename them back to the original.

I can see this point.

A simple way would be adding a rename_dict argument to allow overwriting the default names. For example xe.Regridder(..., rename_dict = {'lon': 'x', 'lat': 'y', 'lon_b': 'x_b', 'lat': 'y_b'}) This basically integrates xarray.Dataset.rename into the regridder API...

Or a global configuration capability like dask.config.set:

xe.config.set(rename_dict = {'lon': 'x', 'lat': 'y', 'lon_b': 'x_b', 'lat': 'y_b'})

Or even as context manager:

rename_dict1 = {'lon': 'x', 'lat': 'y', 'lon_b': 'x_b', 'lat': 'y_b'}
rename_dict2 = {'lon': 'longitude', 'lat': 'latitude', 
                'lon_b': 'longitude_b', 'lat'_b: 'latitude_b'}

with xe.config.set(rename_dict = rename_dict1):
    xe.Regridder(...)  # automatically use `rename_dict1`

with xe.config.set(rename_dict = rename_dict2):
    xe.Regridder(...)  # automatically use `rename_dict2`

So people can set any coordinate names they are accustomed to.

Would this be more intuitive & convenient from a user's perspective? This also avoids complicating the major API, considering that the majority of users should be OK with the default settings.

JiaweiZhuang commented 5 years ago

This might be over-engineering, but if a user really wants to mix multiple names, a list of candidate names can also be possible:

xe.config.set(rename_dict = {'lon': ['lon', 'longitude', ''x'], 
                             'lat': ['lat', 'latitude', 'y']})
#  the look-up priority is the order of names in the list: 'lon', then 'longitude', then 'x'

I would prefer to let users explicitly code up the rules they want, rather than to have some implicit heuristics for them. The later might lead to tricky conditions that are hard to explain & debug.

spencerahill commented 5 years ago

@JiaweiZhuang those are all cool ideas. But I do wonder if all but the name_dict argument are too much for now. After all, this may not even turn out to be a big use case.

But ultimately whatever you decide is fine. My final 2 cents on this issue is just that whatever is implemented needs tests...there could be a fair number of tricky corner cases.

bbakernoaa commented 5 years ago

@JiaweiZhuang I like the idea of being able to pass a configure dictionary. Do you mean to be able to pass multiple keys for it to search through to find in the configure? If so I like the idea. This way multiple keywords could be loaded at once. It could be very advantageous if you open multiple datasets with different definitions of variables.

stefraynaud commented 5 years ago

How about searching for lon and lat variables also by checking the standard_name and units attributes? For instance for longitudes, the standard_name must starts with longitude or longitude_at_. Here are the conventions: http://cfconventions.org/Data/cf-conventions/cf-conventions-1.6/build/cf-conventions.html#latitude-coordinate http://cfconventions.org/Data/cf-conventions/cf-conventions-1.6/build/cf-conventions.html#longitude-coordinate

ahuang11 commented 4 years ago

metpy has a parse_cf() function that labels coordinates' attrs with _metpy_axis and the corresponding axis https://unidata.github.io/MetPy/latest/_modules/metpy/xarray.html#MetPyDatasetAccessor.parse_cf, but that would mean having to make metpy a dependency (or check if it metpy is installed and if so, allow users to have this additional functionality)

pochedls commented 4 years ago

@bbakernoaa and @JiaweiZhuang - I had a workaround that I now realize is very similar to this pull request (it is here: https://github.com/pochedls/xESMF/commit/7ec903dd9cf8dce24b492a20117a9ba607b791ef), though it doesn't deal with the bounds (and my get_axis_ids function is a little different). Is there any reason the pull request in this conversation can't be merged (after conflicts are resolved)? Let me know if I can help.

JiaweiZhuang commented 4 years ago

@pochedls I summarized my concerns and proposals at #74. If you'd like to take a look at that it would be great!