NCPP / ocgis

OpenClimateGIS is a set of geoprocessing and calculation tools for CF-compliant climate datasets.
Other
70 stars 19 forks source link

Interoperability with xarray/dask #479

Open bekozi opened 6 years ago

bekozi commented 6 years ago

Add to_xarray(...) on ocgis fields and potentially a to_ocgis(...) somewhere on the other end.

Ping @jhamman

jhamman commented 6 years ago

Thanks for opening this @bekozi.

Is a field a single variable (e.g. temperature)? Or is it a collection of variables?

For a single variable, we'd be looking to map the ocgis structure to an xarray.DataArray. The constructor for that is:

xarray.DataArray(data, coords=None, dims=None, name=None, attrs=None)

The xarray.Dataset is just a dict-like container of DataArrays. So getting the mapping for a DataArray done would be the first step.

bekozi commented 6 years ago

Is a field a single variable (e.g. temperature)? Or is it a collection of variables?

It's a bit of both. It was easier to have fields also be collections so that is what they are - essentially collections associated with coordinate variables via a dimension map. They are also dict-like containers similar to the xarray.Dataset.

For a single variable, we'd be looking to map the ocgis structure to an xarray.DataArray. The constructor for that is:

Cool, thanks. It seems pretty straightforward, but I do have some questions (turned into more than I expected...):

  1. ocgis has a GeometryVariable storing an n-dimensional object array (it can have a mask too) of shapely geometries. Does xarray work with geopandas? What is the most appropriate way to move data back and forth here?
  2. When working with a decomposition, ocgis dimensions store local and global bounds. Can this information be maintained (elegantly) so converted objects running in parallel don't get lost? This is kind of a detail, but worth keeping in mind... Actually, most of these questions are details.
  3. Does xarray do anything special with coordinate systems (e.g. WGS84)?
  4. Are coordinate bounds variables just passed to coords?
  5. Is the best method for passing masked data to values a masked array? Are fill values maintained?
  6. Does xarray support ragged arrays (i.e. netcdf variable length types)? This is important for UGRID data.
  7. It looks like groups are supported in xarray so we can move those over as well. Anything to be aware of there?
  8. ocgis maintains singleton dimensions. Will these be squeezed by default? Undimensioned variables okay in xarray?

Sorry for the barrage here. Take your time. :grin:

jhamman commented 6 years ago

ocgis has a GeometryVariable storing an n-dimensional object array (it can have a mask too) of shapely geometries. Does xarray work with geopandas? What is the most appropriate way to move data back and forth here?

There hasn't been much (any?) direct integration with geopandas. We can store arrays of objects in xarray but there won't be the geo-hooks that geopandas has. I think that is okay for a first cut.

When working with a decomposition, ocgis dimensions store local and global bounds. Can this information be maintained (elegantly) so converted objects running in parallel don't get lost? This is kind of a detail, but worth keeping in mind... Actually, most of these questions are details. Does xarray do anything special with coordinate systems (e.g. WGS84)?

If these are scalars (or tuples), they can either be stored as coordinates or as attributes. If they are attributes, they won't be used by xarray's operations but they will stay with the data.

Are coordinate bounds variables just passed to coords?

Yes, usually. If they have different dimensions/shapes, they need a bit of special treatment but for now, let's treat them as coords.

Is the best method for passing masked data to values a masked array? Are fill values maintained?

If you pass in a masked array, xarray will convert it to a standard numpy array and use nan's as fill values. This is mostly for performance. I'm not 100% sure but I don't think we store the fill value. We could probably come up with something though.

Does xarray support ragged arrays (i.e. netcdf variable length types)? This is important for UGRID data.

No. There are conversations in the works for supporting sparse arrays but this is not a feature yet.

It looks like groups are supported in xarray so we can move those over as well. Anything to be aware of there?

Do you mean groups in the hierarchical sense (like HDF)? This is not supported by xarray.

ocgis maintains singleton dimensions. Will these be squeezed by default?

I think this should work (as long as the array shape has the same number of dimensions as the specified dimensions).

Undimensioned variables okay in xarray?

Yes. Xarray will add dummy dimension names (e.g. dim_0) but this is supported.

bekozi commented 6 years ago

Thanks for the quick response @jhamm. All sounds pretty good!

There hasn't been much (any?) direct integration with geopandas. We can store arrays of objects in xarray but there won't be the geo-hooks that geopandas has. I think that is okay for a first cut.

Got it. I too think that'll work okay at first.

I'm not 100% sure but I don't think we store the fill value. We could probably come up with something though.

Not that big of deal, more curious how masking is treated. It only matters in very specific cases.

No. There are conversations in the works for supporting sparse arrays but this is not a feature yet.

I think the object arrays will suffice now that I think about it.

Do you mean groups in the hierarchical sense (like HDF)? This is not supported by xarray.

I do. And oops, I thought I saw this as a feature. This will be a problem with spatial collections where a subset geometry stores it's associated subsetted field. It should be possible to get around this by "melting" but will take a bit of leg work. Not as big a deal with unioning and spatial averaging.

bekozi commented 6 years ago

@jhamman I've made some progress on this: https://github.com/NCPP/ocgis/tree/i479-xarray. I need to add coords which may take a little bit. I did encounter one thing. Is it a known feature that data types in xarray or pandas are upcast to float if an integer array has a mask? See this code as an example. The test currently fails. What do you think?

jhamman commented 6 years ago

Cool! Glad to see this is moving forward.

Is it a known feature that data types in xarray or pandas are upcast to float if an integer array has a mask?

Yes. I think this issue is the closest open one to what you're running into: https://github.com/pydata/xarray/issues/1194

I suspect we'll get there eventually but not particularly soon.

bekozi commented 6 years ago

Yes. I think this issue is the closest open one to what you're running into: pydata/xarray#1194

Thanks! Next time I'll do the legwork on the issue search...

bekozi commented 6 years ago

@jhamman I hit an issue in decode_cf. I worked a bit on constructing a Dataset using the ocgis interpretation of metadata, but I realized that using xarray's decoding facility would probably be better for xarray internals.

Anyway, I think the error is related to time bounds (it doesn't happen when I strip the time bounds variable from the dataset prior to decoding).

Error traceback:

======================================================================
ERROR: test_to_xarray (ocgis.test.test_ocgis.test_collection.test_field.TestField)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/benkoziol/Dropbox/NESII/project/ocg/git/ocgis/src/ocgis/test/test_ocgis/test_collection/test_field.py", line 455, in test_to_xarray
    xr = field.to_xarray()
  File "/home/benkoziol/Dropbox/NESII/project/ocg/git/ocgis/src/ocgis/collection/field.py", line 982, in to_xarray
    ret = decode_cf(ret)
  File "/home/benkoziol/anaconda2/envs/ocgis/lib/python2.7/site-packages/xarray/conventions.py", line 603, in decode_cf
    decode_coords, drop_variables=drop_variables)
  File "/home/benkoziol/anaconda2/envs/ocgis/lib/python2.7/site-packages/xarray/conventions.py", line 536, in decode_cf_variables
    stack_char_dim=stack_char_dim)
  File "/home/benkoziol/anaconda2/envs/ocgis/lib/python2.7/site-packages/xarray/conventions.py", line 471, in decode_cf_variable
    var = coder.decode(var, name=name)
  File "/home/benkoziol/anaconda2/envs/ocgis/lib/python2.7/site-packages/xarray/coding/times.py", line 348, in decode
    if 'units' in attrs and 'since' in attrs['units']:
TypeError: argument of type 'NoneType' is not iterable

This is the dataset to decode:

<xarray.Dataset>
Dimensions:      (bounds: 2, time: 3, x: 360, y: 180)
Coordinates:
  * x            (x) float64 -179.5 -178.5 -177.5 -176.5 -175.5 -174.5 ...
  * y            (y) float64 -89.5 -88.5 -87.5 -86.5 -85.5 -84.5 -83.5 -82.5 ...
  * time         (time) float32 1.0 2.0 3.0
Dimensions without coordinates: bounds
Data variables:
    xbounds      (x, bounds) float64 -180.0 -179.0 -179.0 -178.0 -178.0 ...
    ybounds      (y, bounds) float64 -90.0 -89.0 -89.0 -88.0 -88.0 -87.0 ...
    foo          (time, y, x) float32 12.999924 12.998706 12.996271 ...
    time_bounds  (time, bounds) float32 0.5 1.5 1.5 2.5 2.5 3.5

This time variable decodes okay:

<xarray.DataArray 'time' (time: 3)>
array([1., 2., 3.], dtype=float32)
Coordinates:
  * time     (time) float32 1.0 2.0 3.0
Attributes:
    axis:      T
    calendar:  standard
    units:     days since 0001-01-01 00:00:00
    bounds:    time_bounds

Here is the bounds variable causing an issue:

<xarray.DataArray 'time_bounds' (time: 3, bounds: 2)>
array([[0.5, 1.5],
       [1.5, 2.5],
       [2.5, 3.5]], dtype=float32)
Coordinates:
  * time     (time) float32 1.0 2.0 3.0
Dimensions without coordinates: bounds
Attributes:
    calendar:  standard
    units:     days since 0001-01-01 00:00:00
jhamman commented 6 years ago

I'm not able to reproduce this working off xarray@master. What happens if you try to decode the time_bounds variable by itself?

xr.decode_cf(ds['time_bounds'])

or if you try to set the time_bounds variable as a coordinate:

ds = ds.set_coords('time_bounds')
xr.decode_cf(ds)

For some reason, the attrs variable in decode() is None.

bekozi commented 6 years ago

..sorry for the slow response...caught a human bug...

Thanks for trying to reproduce the issue. Looking at this again, this is something ocgis is introducing in the spatial coordinate bounds variable attributes during extrapolation. Here is the fix on this end. After changing this, the decoding works. Basically, it was setting the units on the bounds even if the units are none. So, now the conversion seems to be working!

I know we talked briefly about bounds variables and coordinates in xarray. The bounds are just extra variables in the dataset:

<xarray.Dataset>
Dimensions:             (bounds: 2, dim_0: 0, time: 3, x: 360, y: 180)
Coordinates:
  * x                   (x) float64 -179.5 -178.5 -177.5 -176.5 -175.5 ...
  * y                   (y) float64 -89.5 -88.5 -87.5 -86.5 -85.5 -84.5 ...
  * time                (time) object    1-01-02 00:00:00 ...
Dimensions without coordinates: bounds, dim_0
Data variables:
    xbounds             (x, bounds) float64 ...
    ybounds             (y, bounds) float64 ...
    foo                 (time, y, x) float32 ...
    ocgis_point         (y, x) object ...
    latitude_longitude  (dim_0) float64 ...
    time_bounds         (time, bounds) object ...

The coordinate system variable is also just transferred over with no special attributes. It would be possible to insert a PROJ or WKT description into its attributes. The ocgis_point variable is a 2D array of shapely point geometries.

I am thinking about cleaning this up a bit and pushing. Can you think of anything else that should be added in this first cut?

jhamman commented 6 years ago

I think this is a good start. I'm quite busy over the next few weeks and don't know if I'll have any real time to put it into action. I'm fine with your plan of providing a first cut as an experimental feature.

bekozi commented 6 years ago

Initial capability is in master. I'm going to close this and as issues/features come up we can create new tix.

bekozi commented 5 years ago

Reopening to capture internal changes related to xarray compatibility for xESMF spatial chunking and concurrent weight generation (https://github.com/JiaweiZhuang/xESMF/pull/26, https://github.com/JiaweiZhuang/xESMF/issues/29).

Major remaining issues are related to coordinate systems and spatial masking.

Branch: https://github.com/NCPP/ocgis/tree/i479-xarray-interop

bekozi commented 5 years ago

@huard, a good example to work from for xclim is this masking test for the xarray driver:

You can also use operations to subset the xarray field.

I expect a number of operations will fail as this is not exhaustively tested. There are numerous specialization points in the code that should be on the driver eventually. It's just a matter of collecting all the edge cases...

Time bounds calculations are performed when a grouping is computed to create a new time axis. The new_bounds variable contains the lower and upper bounds for each new climatology centroid as a (2, nt) numpy array.

Let me know if you have any questions!