NOAA-ORR-ERD / gridded

A single API for accessing / working with gridded model results on multiple grid types
https://noaa-orr-erd.github.io/gridded/index.html
The Unlicense
64 stars 14 forks source link

Celltrees, regridding, xarray #55

Open Huite opened 4 years ago

Huite commented 4 years ago

After some back and forth with @ChrisBarker-NOAA in the ugrid repo, I thought I'd let you know what I've been up to, and I've got some questions / remarks.

First of all, the celltree looks like a pretty nifty data structure. I wanted to use it for area weighted regridding, in which I need to know overlap between source and destination cells. Best way to solve it seems like searching the tree with bounding boxes. Rather than implement this in Cython (which I wasn't that familiar with), I generally prefer Numba: https://numba.pydata.org/ Since it's pretty seamless with Python and doesn't require distributing binary dependencies. I ended up porting the C++ CellTree to Python and added a method to search bounding boxes): https://github.com/Huite/xugrid/blob/master/xugrid/geometry/cell_tree.py (I only found out at 70% or so that I could've just used re-used the tree built by the C++ module ... the best way to understand something is to implement it yourself it, I guess...)

With some coaxing of the compiler it gets respectable serial performance, and since Numba has a prange just like Cython, it's very easy to parallellize.

I've also been looking at clipping (since I have to figure out intersection areas), it turns out the GIS stuff is indeed way too slow: (All scripts and source here: https://github.com/Huite/clipping-benchmarks)

Anyway, with a celltree and a clipping function, it's not so hard to get an area weighted regridder together. I've sketched an implementation here: https://github.com/Huite/xugrid/blob/master/xugrid/regrid.py

In my neck of the woods (geohydrology), I can get pretty far with some area weighted means (arithmetic, harmonic, geometric) and linear interpolation. But what techniques do you guys typically use for regridding? I'm curious to hear whether this is at all useful to you.

Finally, I opened an issue in the xarray repo, about maybe setting up an xarray extension for some unstructured mesh utils, using the ugrid data model: https://github.com/pydata/xarray/issues/4222 The conversation with @rabernat in #20 in this repo is clarifying, but I don't think I've quite wrapped my head around things -- I'll also admit I'm rather wedded to xarray. Groundwater flow is also rather straightforward (Darcy's law is linear, relatively local (cartesian) scale), so I've never had to deal with staggered or Arakawa grids. But maybe you have some comments on how my ideas fit in (if they do at all).

ChrisBarker-NOAA commented 4 years ago

I haven't had time to think through all this, but a few quick thoughts:

1) I haven't done any regridding, but I know the folks they do are focused on conserving properties (mass, momentum), so need to know a bit about what the model grids are really telling you. I"d look at what the EMSF project is doing -- they've put a lot of work in on that.

2) As for xarray: We've been keeping an eye on xarray for along time -- back in it's early days, it wasn't mature enough to build our stuff on, and Stephan was pretty clear that he didn't want to complicate it with all this grid stuff. But it has matured a lot since then, so:

Last year at SciPy, we talked about it a bit -- I think there may be some room for an abstraction of "coordinates" in xarray that could then be used to "plug in" grid-aware code, like gridded. But I stil dont totally get xarray enough to know how that could be done.

But then, there's the other direction: gridded is based around numpy-like arrays. In practive the only ones it supports are numpy arrays and netcdf4 variables. but an xarray array should be able to be plugged in as well.

And we could go one further -- and refactor gridded to use xarray arrays internally. then it supplies the abstraction around netcdf and others, and give you all teh nifty xarray stuff, like access to dask, and zarr, and ???

So I think that's well worth pursuing. I don't think it would be very hard, a xarray dataset is kinda like a netcdf4 Dataset. There are some decisions to be made about how to mesh the abstractions, but I think they are pretty solvable.

So I think that's the way to get your xarray and gridded, too.

I was hoping to sprint on this at SciPy this year, but it just didn't seem worth it with the remote issues. But maybe a gridded sprint at some point is in order?

BTW, I think your code for " a celltree and a clipping function" could make sense in gridded, if you'd consider a PR.

jay-hennen commented 4 years ago

Hey! Glad you found the cell_tree2d code useful enough to port it all into numba! Here's some quick notes about that, since you've delved deeper than most into the source code.

There is one other celltree implementation in the wild that I am aware of: vtkCellTreeLocator. I originally wrote cell_tree2d as a unique implementation of the source paper with some thematic guidance from the VTK code. We wanted to use celltree without all the VTK baggage, and also specialize it for our use case, (2D only, one or two polygons per leaf).

The VTK code was I believe directly written by the paper author, so I point to it as the 'authoritative' example, in case you're interested. I'm sure it's also a superior implementation, assuming you configure it optimally for your use case. However, I also remember finding the source code itself so obtuse and intensely comment free that at the time that I made better headway reading the paper and writing an original implementation than trying to adapt or prune the existing one.

I tried to write Cell_tree2d so that making it 3D one day would not be an enormous lift, but the jury is still out on whether I succeeded. It also seems like you are close to another feature that the paper alluded to and one which I didn't implement: passing a bounding box in and getting all the intersecting polygons out. This in particular would be a very useful tool for anyone wanting to subset an unstructured grid.

Sorry I can't say much more on your regridder than what Chris has already alluded to. The nuances in conserving properties and the like aren't my area of expertise.

Huite commented 4 years ago

@ChrisBarker-NOAA I think you're right, Gridded has the appropriate concepts in terms of Dataset, Grids, Variables. But at first glance, it does seem like a huge refactor to me: there's almost not a single module that would remain untouched? There seems to be quite some shared functionality (e.g. time handling, of course IO).

Additionally, I've had relatively good experiences with conda in the past few years, so I've adapted a somewhat maximalist position with regards to dependencies (just let conda figure it out). Fully featured xarray requires quite a number of dependencies, e.g. xarray delayed evaluation requires dask (which in turn requires bottleneck for a number of operations, and so forth).

So I'd expect breaking changes, and you'd up with quite a different package with a much larger footprint.

Just pitching an idea: I'm relatively familiar with the xarray API -- maybe I could try to mimick some parts of the gridded API, but taking xarray objects rather than numpy arrays as the fundamental object? I'll try to learn as much as possible from the gridded API (taking the bits I need first), and prototype some stuff taking xarray as the departure point (rather than numpy and netcdf4). Based on that, you could decide whether that provides a starting point for a gridded sprint to seriously refactor. Even if we end up going our own separate ways, there's still a bunch of useful PRs I could contribute (like the mentioned clipping & regridding).

On the short term, that also gives me a bit more freedom (because I don't to have to ask you to spend resources checking my work), which could very well be necessary to keep my managers happy for the goals set for this year. On the long term, of course I'd greatly prefer joining forces.

@jay-hennen I did end up finding the vtk version as well! Yours was indeed very pleasant to read ^^ I don't expect the third dimension to make a huge difference, I'll probably have a try in the future. I did fall back on a stack based approach rather than using recursion, because numba doesn't deal as gracefully with recursion as the C++ compiler does, it seems.

I also intend to implement a line intersection query, to gather cross-sections. Admittedly, this can be done more more easily by sampling along the line, then searching for the points, so I might not prioritize it, but there's something to be said for having an exact representation. The vtk version does implement this, so I can likely use it for some guidance -- it's surprisingly complicated compared to a box or point query though!

ChrisBarker-NOAA commented 4 years ago

@Huite wrote:

"But at first glance, it does seem like a huge refactor to me: there's almost not a single module that would remain untouched? There seems to be quite some shared functionality (e.g. time handling, of course IO)."

Overall:

Well, you may well have to touch most of them, but hopefully not a big change in most.

We tried to construct most of the gridded data structures and algorithms to use "numpy array -like" objects -- so hopefully xarrays will be able to be dropped in many places with little change.

time

As for time handling -- if xarray supports what we need, then we can swap it in for our old code.

IO

As far as IO is concerned, yes and no -- the IO really needs some refactoring anyway, and xarray provides a similar data model to netcdf4, so a direct port should be doable.

But IO was intended to be separated from the rest of it anyway, so we can either:

Add a new set of xarray IO modules and methods or Port the netcdf ones to xarray using it as an abstraction layer over netcdf (and others).

dependencies

Yes, that was a concern early on -- and one of the reasons we didn't build on xarray ion the first place. But now that the conda ecosystem has matured, I think depending on xarray is OK. And some of the more complicated dependencies (dask, etc) are optional -- testing now, the only thing "big" xarray pulls in is pandas -- and that's very well supported in conda and elsewhere.

So we're fine with that.

review / forking

No point in re-writing anything that's already in gridded anyway. So I think we could start an xarray branch, and let you have at it, and see where it goes. If you end up taking it in a direction that we don't think will work for us, then you can fork and go there. But I hope we'll be able to find an approach that will work for all of us, and then we can benefit from the collaboration.

Frankly, we really need some more collaboration on this project -- I've always had a vision for it that it really has not met. We started out by merging pyUGRID and pySGRID, which were not quite compatible, and then added the stuff we actually needed. But we never finished the whole package, as we only had the time to actually code the stuff we needed for our own use case. But I do think we've got useful framework in there, and some good features, so expanding the developer/user base would be great.

line intersection

Yeah, line intersection is bit tricky, but not horrible. related to that, Rob Hetland and I started a geometry package (https://github.com/NOAA-ORR-ERD/geometry_utils) -- the idea was to have basic computational geometry routines that worked with raw numpy arrays, rather than having to build up a whole system of geometric objects. I recall if line intersection is in there, but I can add it if need be.

Though we may want to extend that, or pull some stuff for gridded, to work with the grids data structure(s) -- i.e. that the lines, polygons, etc are defined in terms of indexes into a pints array that has the actual coordinates.

Huite commented 4 years ago

That sounds rather promising!

Thinking it over, I came at this with the wrong idea initially: my hope was that I'd be able to leverage the xarray Dataset as much as possible -- but I think I'm coming around to realizing it's fundamentally not the right abstraction. Using xarray for structured data, you can e.g. do .sel(x=slice(xmin, xmax) and it returns a consistent dataset, with all the variables that depends on x sliced. But this falls apart when they're only "weakly linked" in a UGRID data structure and the xarray dataset doesn't help that you that much in terms of doing the accounting. So indeed, I'd probably end up re-writing a lot.

I think you could use an xarray.Dataset to e.g. store the UGRID topology:

class UGrid(object):
    def __init__(self,
                 nodes=None,
                 node_lon=None,
                 node_lat=None,
                 faces=None,
                 edges=None,
                 boundaries=None,
                 face_face_connectivity=None,
                 face_edge_connectivity=None,
                 edge_coordinates=None,
                 face_coordinates=None,
                 boundary_coordinates=None,
                 data=None,
                 mesh_name="mesh",
    ):

It might be useful internally to automatically enforce a few things, e.g. shared dimensions on node_lon and node_lat, and a number of conveniences, but that's something to be seen. At the least it provides a pretty convenient way of reading and writing.

But yeah, I think the Python ecosystem has matured wonderfully over the past few years. Numba's really great too, you can easily implement those geometry routines now in "pure Python", no hassle with compilers or interfacing code, and you can even get people who don't really identify as technical specialist to contribute sometimes.

I really appreciate these discussions, by the way. I'll start some work on a fork as you suggested -- hopefully soon, but I still have some reports to write for some other projects (and I probably ought to enjoy the summer while it lasts).

rabernat commented 4 years ago

I'm glad to see this discussion happening. A few points from the xarray side you might want to consider.

Xarray has recently received funding from CZI for a bunch of feature improvements. Chief among this is flexible indexes. So if you don't like what .sel does by default, you will soon be able to implement your own index that has your desired behavior, using a celltree or whatever mechanism you can come up with.

I also wanted to suggest looking in xarray accessors. These allow third-party packages to extend xarray objects with their own namespace. This would allow you to write code like

import xarray as xr
import ugrid # registers an accessor on import
ds = xr.open_dataset(...)
ds.ugrid.do_something()

This could be a good way to plug other functionality into xarray. This is what the new cf-xarray package does.

ChrisBarker-NOAA commented 4 years ago

Thanks @rabernat.

I'm not sure that "flexible indexes" is the right abstraction -- maybe I'm being thrown off by the name, but when you are dealing with unstructured grids, "indexes" really not the point -- at all. In a nutshell, the way stuff is stored (the arrays, and indexes into those arrays) does not necessarily tie to the physical world.

The whole idea of gridded is to provide a higher level of abstraction - -where the user can work in world space (lat, lon, depth) directly without any reference or understanding of the internal arrays or indexes.

I see in the xarray docs that "composition is preferred over subclassing" -- which I think really applies here. That is, I think a Gridded Dataset can HAVE an xarray dataset, but I don't think it can BE an xarray Dataset.

Perhaps a gridded Variable can be closer to an xarray array, but I still think that's going to be a composition thing.

That not to say that we wouldn't share as much of the API as possible, or that users can't have access to the underlying xarray arrays (or dataset) when that makes sense, but that would be an explicit choice.

Anyway, it'll take some experimentation to see how this all works out.

rabernat commented 4 years ago

Yes, I think you're being thrown off by the name and assumptions about what "index" means. Indexing just means looking up data from the dataset using an arbitrary query.

ChrisBarker-NOAA commented 4 years ago

Ok then -- i'll keep an open mind :-)

benbovy commented 4 years ago

Just landed here from https://github.com/pydata/xarray/issues/4222.

To illustrate @rabernat's comment above, we've started exploring Xarray-based solutions for indexing unstructured grids in the xoak repository. There are some examples in this gist (at the end of the notebook):

https://nbviewer.jupyter.org/gist/benbovy/adcb7d3b14d22014444a2c57efb30696

It extends Xarray Dataset .sel() and .set_index() via an accessor, and it leverages Xarray's advanced indexing features. Eventually, this accessor won't be needed anymore when flexible indexes will be implemented in Xarray (note: I'll start working on this very soon - next month or so - so I'd be very interested to get in touch and hear your thoughts).

The idea with flexible indexes is that one or more coordinates can be used together to explicitly build an index of any kind (i.e., a class that needs to provide some common interface that has still to be defined), treated as 1st-class citizen in Xarray's data model. In principle, it would be possible to build a custom index from all ugrid dataset coordinates (nodes, faces, edges, connectivity...), which will be used for grid-aware, consistent indexing with Dataset.sel().

That said, I agree that, even with flexible indexes, plain Xarray datasets might not provide the best level of abstraction for all grids, as there are other grid-aware operations than just selecting some data on the grid and because Xarray accessors have their limits. At least, flexible indexes will help reducing maintenance burden at other levels of abstraction.

BTW, celltree looks very interesting!

Huite commented 4 years ago

Hi @benbovy,

That's all excellent news, I was thinking of starting something like xoak myself -- very happy to see you've already started!

There are some examples in this gist (at the end of the notebook)

Good to know about the accessor indexing, I was still missing a nice example like that.

note: I'll start working on this very soon - next month or so - so I'd be very interested to get in touch and hear your thoughts

Ah, this is wonderfully timed! And indeed, there's still some figuring out to do, I think we'll have to do some experimentation to see what sticks and what doesn't.

ChrisBarker-NOAA commented 4 years ago

I'm still wrapping my head around what the API would look like here -- anyone have an example of how this would be used? i.e. how to get a value of a variable at a given lat, lon and time?

Anyway, I hope that we'll be able to leverage the code / data model, etc already in gridded, rather than implementing it all over again.

Key here is that you need a data model / represents each type of grid -- and ideally the same one for all grid types, which is what gridded was designed to be.