Closed bbakernoaa closed 2 years ago
I like this. I'm wondering if it would be within reason to generalize this even further, to allow the user to specify their own lat and lon coordinate names, e.g. if their data uses 'x' and 'y' or 'LONGITUDES' and 'LATITUDES' or any other variants.
Thanks for the PR, that's a good point...
We need to think carefully about this because @spencerahill 's comment can be further generalized to
'y' seems a bad idea because the coordinate value is latitude, but other options all seem reasonable. But too many aliases will cause confusion. Also, what if a grid object contains multiple valid variable names?
Even worse for boundary variables:
You name it. I am happy to discuss how to handle this potential chaos...
One way is to have a signature like
Regridder(grid_in, grid_out, ..., lon_name = 'lon', lat_name = 'lat')
where lon_name
can be overwritten by users. But is this really more convenient than simply renaming the variable?
It might be worthwhile to take a step back and look at what functionality other popular regridding tools offer in this area. The two I'm most familiar with are NCO and CDO, both are command line tools. My understanding is that both of these tools default to using coordinate information in the dataset (netCDF file). NCO seems to be the most flexible and works like this:
coordinates
attributes, these are used to define the gridcoordinates
attribute, then some basic heuristics are used to determine where to find the coordinate informationncremap -R "--rgr lat_nm=xq --rgr lon_nm=zj" -d dst.nc -O ~/rgr in.nc # Manual
)So xarray provides a lot of the necessary metadata to make the first two steps possible. One suggestion would be to first look at the grid variables and see if we can determine their coordinate variables, next, look for common names, finally, if we can't find a coordinate variable, raise an error. Of course, the signature of lon_name
and lat_name
should be optional, probably defaulting to None
.
I suggest that we support by default a limited number of them but add the option to override it with a user provided keyword
On Fri, Oct 19, 2018 at 11:02 PM Joe Hamman notifications@github.com wrote:
It might be worthwhile to take a step back and look at what functionality other popular regridding tools offer in this area. The two I'm most familiar with are NCO and CDO, both are command line tools. My understanding is that both of these tools default to using coordinate information in the dataset (netCDF file). NCO seems to be the most flexible and works like this:
- if the grid dataset has variables with coordinates attributes, these are used to define the grid
- if no variables have the coordinates attribute, then some basic heuristics are used to determine where to find the coordinate information
- finally, these can all be overridden with command line options (e.g. cremap -R "--rgr lat_nm=xq --rgr lon_nm=zj" -d dst.nc -O ~/rgr in.nc # Manual)
So xarray provides a lot of the necessary metadata to make the first two steps possible. One suggestion would be to first look at the grid variables and see if we can determine their coordinate variables, next, look for common names, finally, if we can't find a coordinate variable, raise an error. Of course, the signature of lon_name and lat_name should be optional, probably defaulting to None.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JiaweiZhuang/xESMF/pull/38#issuecomment-431543405, or mute the thread https://github.com/notifications/unsubscribe-auth/AVFKt05r41xm-WYzvatab9_4a_YdIICJks5umpI4gaJpZM4Xwn5e .
So xarray provides a lot of the necessary metadata to make the first two steps possible. One suggestion would be to first look at the grid variables and see if we can determine their coordinate variables, next, look for common names, finally, if we can't find a coordinate variable, raise an error. Of course, the signature of lon_name and lat_name should be optional, probably defaulting to None.
I like this.
But is this really more convenient than simply renaming the variable?
Yes, I think so. Otherwise in order to use xESMF a user has to first rename their coordinate(s), then use xESMF, and then if their pipeline requires the original name down the line to rename them back to the original.
Thanks @spencerahill for catching that.
Like I said I suggest that you should support a limited number of named variables for latitude and longitude. Stick with the big conventions like COARDS and CF and add the option to override the automatic detection if the variable is provided.
Otherwise in order to use xESMF a user has to first rename their coordinate(s), then use xESMF, and then if their pipeline requires the original name down the line to rename them back to the original.
I can see this point.
A simple way would be adding a rename_dict
argument to allow overwriting the default names. For example
xe.Regridder(..., rename_dict = {'lon': 'x', 'lat': 'y', 'lon_b': 'x_b', 'lat': 'y_b'})
This basically integrates xarray.Dataset.rename
into the regridder API...
Or a global configuration capability like dask.config.set
:
xe.config.set(rename_dict = {'lon': 'x', 'lat': 'y', 'lon_b': 'x_b', 'lat': 'y_b'})
Or even as context manager:
rename_dict1 = {'lon': 'x', 'lat': 'y', 'lon_b': 'x_b', 'lat': 'y_b'}
rename_dict2 = {'lon': 'longitude', 'lat': 'latitude',
'lon_b': 'longitude_b', 'lat'_b: 'latitude_b'}
with xe.config.set(rename_dict = rename_dict1):
xe.Regridder(...) # automatically use `rename_dict1`
with xe.config.set(rename_dict = rename_dict2):
xe.Regridder(...) # automatically use `rename_dict2`
So people can set any coordinate names they are accustomed to.
Would this be more intuitive & convenient from a user's perspective? This also avoids complicating the major API, considering that the majority of users should be OK with the default settings.
This might be over-engineering, but if a user really wants to mix multiple names, a list of candidate names can also be possible:
xe.config.set(rename_dict = {'lon': ['lon', 'longitude', ''x'],
'lat': ['lat', 'latitude', 'y']})
# the look-up priority is the order of names in the list: 'lon', then 'longitude', then 'x'
I would prefer to let users explicitly code up the rules they want, rather than to have some implicit heuristics for them. The later might lead to tricky conditions that are hard to explain & debug.
@JiaweiZhuang those are all cool ideas. But I do wonder if all but the name_dict
argument are too much for now. After all, this may not even turn out to be a big use case.
But ultimately whatever you decide is fine. My final 2 cents on this issue is just that whatever is implemented needs tests...there could be a fair number of tricky corner cases.
@JiaweiZhuang I like the idea of being able to pass a configure dictionary. Do you mean to be able to pass multiple keys for it to search through to find in the configure? If so I like the idea. This way multiple keywords could be loaded at once. It could be very advantageous if you open multiple datasets with different definitions of variables.
How about searching for lon and lat variables also by checking the standard_name and units attributes?
For instance for longitudes, the standard_name must starts with longitude
or longitude_at_
.
Here are the conventions:
http://cfconventions.org/Data/cf-conventions/cf-conventions-1.6/build/cf-conventions.html#latitude-coordinate
http://cfconventions.org/Data/cf-conventions/cf-conventions-1.6/build/cf-conventions.html#longitude-coordinate
metpy has a parse_cf()
function that labels coordinates' attrs with _metpy_axis
and the corresponding axis https://unidata.github.io/MetPy/latest/_modules/metpy/xarray.html#MetPyDatasetAccessor.parse_cf, but that would mean having to make metpy a dependency (or check if it metpy is installed and if so, allow users to have this additional functionality)
@bbakernoaa and @JiaweiZhuang - I had a workaround that I now realize is very similar to this pull request (it is here: https://github.com/pochedls/xESMF/commit/7ec903dd9cf8dce24b492a20117a9ba607b791ef), though it doesn't deal with the bounds (and my get_axis_ids function is a little different). Is there any reason the pull request in this conversation can't be merged (after conflicts are resolved)? Let me know if I can help.
@pochedls I summarized my concerns and proposals at #74. If you'd like to take a look at that it would be great!
Adding 'latitude' and 'longitude' to acceptable coordinate names to be compliant with netCDF CF 1.6 conventions. It adds a new method called get_latlon_names that checks if 'lat' or 'latitude' or netCDF COARDS convention or netCDF CF convention is in the xr.DataArray