JiaweiZhuang / xESMF

Universal Regridder for Geospatial Data
http://xesmf.readthedocs.io/
MIT License
269 stars 49 forks source link

Regridding API design #9

Closed JiaweiZhuang closed 6 years ago

JiaweiZhuang commented 6 years ago

I was looking at Pangeo's Regridding Design Document, which suggests an interface like da.remap.remap_like(da_target, how='bilinear'). It looks clean but doesn't match the "two-step" procedure for most regridding algorithms.

Regridding weight calculation and weight application should be done separately, as reviewed by #2. The weights only depend on the source and target grids, not the input data. As long as the grids are not changed, users only need to calculate the weights once and can apply them to any data. Applying weights is often orders of magnitude faster than calculating weights (see the timing in #6 as an example), so separating the two steps has a huge impact on performance, especially when the task is to regrid a lot of data between a fixed pair of grids.

xESMF's current design that dr_out = xe.regrid(ds_in, ds_out, dr_in) basically re-computes the weights every time. Using two steps should boost the performance by 10~100x.


Thus I am thinking about sklearn-like API, which is also two-step:

1) In sklearn you train a model by model.fit(x_train, y_train)

2) then make new predictions by y_pred = model.predict(x_test)

xESMF can do similar things:

1) Calculate regridding weights by weights = xe.compute_weights(ds_in, ds_out, method='bilinear') where ds_in and ds_out are xarray DataSet containing input and output grid information.

2) then apply weights to data by dr_out = weights.apply(dr_in) where dr_in is the input DataArray

3) Because ESMPy writes the weights to a file, the next time you can read it from file instead of computing it again. weights = xe.read_weights("weights.nc") The IO time is negligible compared to re-computing the weights.

Here weights is a tiny class that holds the weights and knows how to apply them to data. Alternatively, weights can be simply an xarray DataSet read by raw_weights = xr.open_dataset("weights.nc"). Then step 2 is changed to dr_out = xe.apply_weights(raw_weights, dr_in). I prefer the first approach because it feels more like sklearn and people might feel it more familiar.

Any comments are welcome, otherwise I'll proceed this way. @rabernat @jhamman @spencerahill. Please also @ anyone who are interested in regridding with xarray.

JiaweiZhuang commented 6 years ago

Another issue with da.remap.remap_like(da_target, how='bilinear') is passing cell boundaries for conservative regridding (pydata/xarray#1475). Using weights = xe.compute_weights(ds_in, ds_out, method='bilinear') avoids the problem because you can have size N+1 variables in a DataSet.

rabernat commented 6 years ago

You are right on with this idea. Having some sort of Regridder class definitely seems like the right API. This is a great suggestion.

jhamman commented 6 years ago

@JiaweiZhuang - I think you want to do both. A Regridder class for users who want to do the two-step regridding process and a high-level convenience function/method that does both steps in one pass.

Another issue with da.remap.remap_like(da_target, how='bilinear') is passing cell boundaries for conservative regridding (pydata/xarray#1475)

This is a good point although we may be able to get around it by adding a kwarg or two to remap_like.

All in all, I think you are on the right track.

Because ESMPy writes the weights to a file, the next time you can read it from file instead of computing it again.

Does it always do this or is this a feature that can be controlled by xESMF?

rabernat commented 6 years ago

Another issue with da.remap.remap_like(da_target, how='bilinear') is passing cell boundaries for conservative regridding (pydata/xarray#1475)

On this topic, we are discussing how to best implement cell distance / area / volume data (generically called "grid metrics") in xgcm/xgcm#81. One possibility is that xgcm will take care of those geometric questions. In this case, a regridding package could map between xgcm grids, rather than xarray datasets.

darothen commented 6 years ago

I think having a step that can compute and cache the weights is a great idea, and very much like the sklearn-style approach. There are a few standard formats that I'm aware of for caching weights - in particular, SCRIP uses a format that both NCL and CDO are able to take in turn.

Now, if everything is done in the framework of xgcm, and there's some notion of "standard" grids for a set of known models, then it may be possible to create a data server/archive which caches the weights for well-known re-gridding operations. Think about how Cartopy has the nifty utility to grab shapefiles from NaturalEarth... what if as part of pangeo-data there was a bucket on EC2 or GCP that catalogued and archived these re-gridding weights and could be directly downloaded by a client? The whole catalogue could be entirely automated, or could even receive requests and create the re-gridding weights on a VM that AWS or GCP spins up if the archive receives a request for an unknown re-gridding operation.

JiaweiZhuang commented 6 years ago

Does it always do this or is this a feature that can be controlled by xESMF?

ESMPy directly writes the weights to disk. So behind weights = xe.compute_weights(ds_in, ds_out, method='bilinear') there will be multiple steps happening:

1) Use ESMPy to write the weights into a NetCDF file

2) Open that NetCDF file to get weights as numpy arrays

3) Build the Regridder object, including converting the weights data to a sparse matrix type.

The entire workflow is available in the notebooks in #6. I don't know if it is possible to directly retrieve weights as numpy arrays without creating a new NetCDF file (@bekozi ?). Because ESMPy calls Fortran to compute the weights, I guess simply dumping the weights to a file is easier than retrieving them as in-memory numpy arrays...

Having the weights available on the disk would be helpful in general because the next time a user is very likely to remap data between the same pair of source and target grids. Instead of spending >10s recomputing them, it is much better to read a ~10MB file. The idea is similar to caching the regriddier in iris.

I do have some concerns about how to keep track of those weight files, though. My plan is to follow the weight file naming in NASA GEOS5 model, so the default weight file name would be something like

Bilinear_600x400_400x300.nc
Conservative_600x400_400x300.nc

where the first word is the regridding algorithm, 600x400 is the input grid size and 400x300 is the out grid size.

The weight file is created when the first time a user calls weights = xe.compute_weights(...). The next time when the same pair of grids are used, xe.compute_weights(...) will default to xe.read_weights("weights.nc") and prints some message/warning. The weight file may be overwritten by something like weights = xe.compute_weights(..., clobber=True)

A potential issue is naming conflicts. Two grids can have the same size but cover different regions. But I believe that can be solved by checking coordinate values internally and allow users to specify file name xe.compute_weights(..., filename="custom_name.nc").

JiaweiZhuang commented 6 years ago

I think you want to do both. A Regridder class for users who want to do the two-step regridding process and a high-level convenience function/method that does both steps in one pass.

Yes I agree that a one-step convenience function is quite useful for occasional users. But the standard&recommended API would be two-step since that's crucial for performance.

JiaweiZhuang commented 6 years ago

There are a few standard formats that I'm aware of for caching weights - in particular, SCRIP uses a format that both NCL and CDO are able to take in turn.

ESMPy does dump SCRIP-like format (#4), although currently the metadata is not complete (only weights, no grid information). I believe @rokuingh is working on that.

what if as part of pangeo-data there was a bucket on EC2 or GCP that catalogued and archived these re-gridding weights and could be directly downloaded by a client? The whole catalogue could be entirely automated, or could even receive requests and create the re-gridding weights on a VM that AWS or GCP spins up if the archive receives a request for an unknown re-gridding operation.

This is exactly how we provide weight files to GEOS-Chem HP users. But that's because our current regridding software is not easy enough for them to use, not because that's more efficient😅. Given that a ~10MB weight file takes ~10s to compute and also ~10s to download (assuming 1MB/s bandwidth), it's hard to see a speed gain by retrieving weights remotely. Indeed, there's a benefit if the entire workflow is done on EC2 or GCP.

mrocklin commented 6 years ago

Hi, I have a couple comments/questions:

  1. From a distributed context I'm somewhat concerned about relying about on-disk NetCDF files. Using these as an on-disk cache and performance optimizaion makes sense to me, but from an API perspective it might be nice to have the semantic return value be an in-memory dataset. This might help if trying to apply these same regridding operations across different machines that don't share the same file system.
  2. I'm curious what the Fortran code is doing and why it takes 10s. Does anyone have an understanding of what regridding code would look like if implemented in NumPy or Cython/Numba?

I'm somewhat new to this domain, so please forgive any ignorance on my part.

JiaweiZhuang commented 6 years ago

I'm curious what the Fortran code is doing and why it takes 10s. Does anyone have an understanding of what regridding code would look like if implemented in NumPy or Cython/Numba?

It is computing the weight matrix i.e. the regridding operation itself. It involves searching and computing the overlapping areas of the two grids, handling various edge cases... A nice overview of regridding operation is available in the SCRIP userguide, Chapter 3~4 (SCRIP is a legacy regridding package, older than ESMF). Rewriting such code from scratch is beyond my capability and might take forever...

jhamman commented 6 years ago

from an API perspective it might be nice to have the semantic return value be an in-memory dataset. This might help if trying to apply these same regridding operations across different machines that don't share the same file system.

Exactly. I think in the short term, using a temporary netcdf file will be okay, provided we can establish a road map for doing away with the weights file. That may just require working with the ESMF developers to establish a Python API for returning arrays of weights.

JiaweiZhuang commented 6 years ago

The new API is implemented in the just-released v0.1.1. Let's use #11 for the weight file issue.