NOAA-ORR-ERD / gridded

A single API for accessing / working with gridded model results on multiple grid types
https://noaa-orr-erd.github.io/gridded/index.html
The Unlicense
64 stars 14 forks source link

deprecate xgcm in favor of gridded? #20

Open rabernat opened 6 years ago

rabernat commented 6 years ago

I just discovered this package. It looks fantastic.

It looks like you are solving the same problem I am trying to solve in xgcm, but in a more general way: http://xgcm.readthedocs.io/en/latest/ https://github.com/xgcm/xgcm

I am using this to process MITgcm outputs. I would be happy to give up on my package and collaborate on this one instead.

However, there are a few issues that stand in the way of me from doing so..

  1. I need a package that consumes and produces xarray datasets, not files. It looks like this package interacts directly with netCDF files, rather than xarray datasets. I would prefer to separate the file I/O problem from the grid operations. Xarray provides the appropriate data structures do do this--there is nothing you can put in a netCDF file that can't be represented with an xarray DataSet object.
  2. On a related point, it look like gridded operates eagerly on numpy arrays, rather than lazily on dask arrays. This is another dealbreaker for me, since many of my model outputs are in the TB range. In the pangeo project we are putting a lot of effort into building architecture that can scale xarray / dask-based analysis on HPC and cloud platforms. It would be great to be able to use gridded in this context.

Curious to hear your thoughts on these points.

rsignell-usgs commented 6 years ago

For a quick example of the power of the xarray/dask approach in pangeo, check out this 3 minute video

ChrisBarker-NOAA commented 6 years ago

Welcome!

And I agree that gridded is kind of more-flexible xgm.

As to your points: We haven't adopted dask (or xarray) in our org, because we end up using a lot of C cocd that we need to pass arrays to, and this is easiest with straight numpy. That, and momentum...

And I talked with Stephan Hoyer a bit about this couple years ago, and he was totally uninterested in adding unstructured grid support to xarray, though the other way around is of course possible.

But it's been a design philosophy from the beginning to:

That being said, most of the development of the code as been to support our use cases, which is netcdf and numpy arrays. So:

Most of the assumptions about netcdf should be in the netcdf reading/writing code though.

On to your specifics:

I need a package that consumes and produces xarray datasets, not files.

where do these xarray datasets come from, if not files? But anyway, that should be totally doable, you're "just" going to need to write the code :-) -- i.e. signature for:

gridded.Dataset.__init__

is:

ncfile=None, grid=None, variables=None, grid_topology=None

this is a semi-overloaded init -- it takes either a netcdf file (name, url, or open netcdf4Dataset), OR the grid and variables objects.

So you could either overload it some more and add an xarray dataset tot hat list, or,(probably a better API), add an alternate constructor:

Gridded.from_xarray(the xarray dataset)

And the Grid and Variable objects can all be created from array-like objects, they don't assume netcdf.

We may well want to do some refactoring for the UGRID and SGRID standard reading code, so that we don't totally duplicate that logic.

It looks like this package interacts directly with netCDF files, rather than xarray datasets. I would prefer to separate the file I/O problem from the grid operations.

That is already separate -- the Grid and Variable (and Depth and Time) objects all get references to array-like objects -- which should work with xarray DataArrays, dask arrays, etc. not well tested, and there will probably be issues to resolve, but the structure should be there.

Note that a gridded.Variable keeps its own copies of attributes, etc -- it can be created from a netcdf Variable, but it is independent, and can be created with raw data or, theoretical, and other object with a array-like interface and attributes.

On a related point, it look like gridded operates eagerly on numpy arrays, rather than lazily on dask arrays.

As I said, we haven't used dask yet, but have tried to keep the array evaluation as lazy as we can.

I would be wary of requiring dask, but if there is a way to make it optional, it would be great to have it as an option.

The grid topology itself kinda has to be loaded into memory all at once, but we don't need to load in Variable data until it's needed.

what kind of operations do you want to do with dask? could it interpolation be parallelized? I don't know. But I'd love to hear your ideas!

Also -- the goal of gridded is to provide a unified API -- if there are multiple implementations under the hood, that would be fine, too :-)

-CHB

jay-hennen commented 6 years ago

Hi rabernat,

Glad to see your interest and optimism about this package! Looks like Chris beat me to the comment button and provided more info to boot, but here's my 2 cents anyway

About your issues:

  1. Currently development focus is driven by use case, and we've primarily been using netCDF. However, although xarray/dask integration has not been a primary goal, we have tried (and I think largely succeeded) to avoid incompatibility with it. So although there are no 'convenience' APIs available yet for working with xarray/dask (such as the .from_netCDF functions, or the file parsing stuff), xarray/dask data structures should be usable anywhere a netCDF or numpy structure is used.

  2. I think it's fair to say that Gridded as a whole does and should not operate eagerly. It is lazy when appropriate, and minimum-eager when appropriate. The 'data representative objects' in the package (Grid, Variable, Dataset, etc) should only require metadata to be instantiated. For example, if you provide a URL to an OpenDAP server as the filename to the Variable.from_netCDF function, it uses the netCDF4 library to open the link and manage requests and memoization of data. At this point only metadata has been loaded from the remote server. However, if you wish to interpolate the data to find the values at a bunch of points, then only the minimum necessary data to compute the results would be requested and eager execution occurs to get the results. For example, if you want to interpolate data to 10,000 points, but they all fit within a dozen grid cells, it will only request the minimum data slice to encompass the data at those grid cells.

As Chris mentioned, the API between components is the important part of this system. For example: Grid_R, Grid_S, and Grid_U all represent different types of grid (rectangular, staggered/curvilinear, and unstructured), and compute information optimally for that specific grid type. However, they all share the same external API, so they can be used interchangeably.

rabernat commented 6 years ago

Thanks for these detailed replies. There is a lot of useful information here.

I'll respond to a few point and then spend some more time digesting what you have said

where do these xarray datasets come from, if not files?

  1. There are several weird binary GCM output formats for which there are now xarray wrappers (e.g. gcpy for geos-chem, xmitgcm for MITgcm)
  2. We are developing a new backend for xarray based on zarr which is ideal for storing data in cloud storage. There are no files, just an object store.

... I talked with Stephan Hoyer a bit about this couple years ago, and he was totally uninterested in adding unstructured grid support to xarray

I would interpret this differently. xarray doesn't support either structured or unstructured grids. It supports arrays with labels and dimensions. It simply has no notion of a grid cell whatsoever. xgcm is meant to extend the data model of xarray to structured, logically rectangular GCM grids.

xarray/dask data structures should be usable anywhere a netCDF or numpy structure is used

I see what you mean... I could suck the .data attributes out of my xarray variables and feed them to gridded. However, this would eliminate the most valuable and powerful reason I like to work with a xarray-aware packages: the persistence of labels (indexes and coordinates) together with the data. The use of pandas indexing objects makes selecting and plotting data very easy and immune to common errors. I'm not sure if this is something that gridded supports or not--it's hard to tell from the tutorials. It would be very hard for me to give this up and go back to working with unlabeled raw numpy data.

Overall, I love what you guys are doing here. I especially like the idea of a "universal" API for both structured and unstructured finite volume grids. When I read through the API documentation, e.g. https://noaa-orr-erd.github.io/gridded/gridded.html#module-gridded.variable I can't escape the impression that you are essentially re-implementing large parts of xarray. It would be great to find a way for these projects to work more closely together.

jay-hennen commented 6 years ago

I think what differs us from xarray is that when you work with a Variable, the intent is that it operates in the context provided by the Grid. Say we are working with time+2D temperature data Temp(T, X, Y). If you want to find the average temperature of the data, or max/min or other aggregate information, you don't need gridded to get that, just work with the data directly. Those are operations that are Grid-independent; whatever spatial orientation the data is meant to occupy doesn't matter.

However, if you ask a question that depends on the spatial organization of the data, such as what is the temperature at a specific time and place, you need to get the Grid involved, and that's where Variable comes into play. It organizes interpolation in the time dimension and grid dimension, while passing off the actual intermediate interpolation work to the Time and Grid objects respectively. Xarray as far as I'm aware can't help you in this area; you're always stuck in a rigid regular grid.

ChrisBarker-NOAA commented 6 years ago

Adding to what @jay-hennen wrote:

A gridded.Variable is a different abstraction than an xarray variable. As you say:

. xarray doesn't support either structured or unstructured grids. It supports arrays with labels and dimensions. It simply has no notion of a grid cell whatsoever

Exactly -- a gridded.Variable is NOT an array, and does not have dimensions, but rather a Grid (and maybe Time and Depth object, which is really another kind of grid). If the grid is regular, then the underlying data is in a rectangular array, and dimensions map to axis of the array, but that's a special case. In fact, we wrote gridded for UGrids and SGRids first, and only added regular grids later. So a Variable tied to a regular grid, does, in fact, look a lot like re-writing xarray.

That being said, under the hood, or course, the actual data is stored in arrays, and those arrays can be any array-like object. And xarray does support a lot of nifty features, particular for regular grids. So perhaps it would make sense to leverage that. The question is how?

So far, we only support creating Gridded.Datasets from netcdf files or raw data. So, to support your use cases, there are essentially two options:

1) write data readers for your other file formats -- seems like wasted work

2) write a data loader for xarray Datasets. from the xarray docs:

"A dataset resembles an in-memory representation of a NetCDF file, and consists of variables, coordinates and attributes which together form a self describing dataset."

Which, by the way, is why I thought xarray was pretty much a wrapper around a netcdf file :-)

But apparently, like gridded, xarray supports a Dataset abstraction that isn't necessarily netcdf. But it sounds like it maps to the netcdf Data model, so it would actually be pretty easy (if tedious) to adapt the net-cdf reading code to xarray.

3) update gridded to require xarray -- we could then adapt the netcdf-reading code to xarray, while still getting full netcdf support. And also take advantage of other features of xarray without having to re-write that. Perhaps we should have done that in the first place.

Frankly, I'm not sure what xarray offers that is helpful enough that the dependency is worth it -- but I haven't given it more than a superficial review either. And the whole dependency thing is less of a big deal than it was few years ago.

Looking again really quickly -- an xarray Dataset has a bunch of data-analysis methods that are all very wedded to regular grids, On the other hand, we could provide a similar API for gridded, and pass off the regular grid functionality to xarray, while essentially re-implementing it for other grid types.

I suppose we should have looked harder at xarray's API (and Iris') before building gridded, but we did look some -- and it it was so wedded to the whole concept of the grid mapping directly to arrays that it just didn't seem useful.

So: It would be great if you would dig in a little deeper and help us decide what's worth doing.

I would think that (2) above would be a good way to start.

ChrisBarker-NOAA commented 6 years ago

One more note:

I see what you mean... I could suck the .data attributes out of my xarray variables and feed them to gridded. However, this would eliminate the most valuable and powerful reason I like to work with a xarray-aware packages: the persistence of labels (indexes and coordinates) together with the data. The use of pandas indexing objects makes selecting and plotting data very easy and immune to common errors. I'm not sure if this is something that gridded supports or not--it's hard to tell from the tutorials. It would be very hard for me to give this up and go back to working with unlabeled raw numpy data.

gridded is all about NOT working with unlabeled raw numpy data!

That's the whole point of it :-) -- but the limitation of Pandas and xarray is that it assumes that the indexing of the arrays maps directly to a "physical space" coordinate -- could be latitude, longitude, time, depth, whatever, but with non-regular grids, that mapping is far more complex -- so labels are simply not enough.

That doesn't mean that gridded has all the features of xarray (Or pandas) but that's because we've only implemented what we've needed so far.

Again -- it may make sense to borrow, rather than re-implement the parts of xarray that are useful to gridded, but we definitely had good reason to write build it in the first place :-)

rabernat commented 6 years ago

Thanks for the clarifications guys! Clearly there is a lot more to gridded than I realized on my first pass through the documentation.

At the same time, there are a few places your characterization of xarray could also use some refinement:

the limitation of Pandas and xarray is that it assumes that the indexing of the arrays maps directly to a "physical space" coordinate -- could be latitude, longitude, time, depth, whatever, but with non-regular grids, that mapping is far more complex -- so labels are simply not enough.

There is no such assumption built in to xarray or pandas. Have a look at this example from the documentation: http://xarray.pydata.org/en/stable/examples/multidimensional-coords.html Hundreds of people are currently using xarray every day to interact with such datasets.

This has been a great interaction. Thanks for taking all the time to explain these details. I'm now going to play around with gridded a bit and see how it works for my typical use cases.