Open djhoese opened 6 years ago
Hi @djhoese et al., thanks for staring the conversation! I've been looking at pyresample and there are a lot of similarities (and some really cool examples!). It seems that the main use case is to go from a grid in a projection/region/orientation to another grid. Is that right? In this sense, it makes sense that satpy uses xarray as an internal data format.
For Verde, the main use case is unevenly sampled data to grids. So xarray is mostly an output format for me. The accessor pattern is really cool but it only works if you're starting out with an xarray Dataset. We usually have pandas.DataFrame or numpy 1D arrays. This is why the inputs to fit
are coordinates and data separately. The same applies to having projection information etc. It would be cool to have Verde include that in the output grid but we get input data in all kinds of formats (usually text files with bad headers, or none at all).
I've been focusing more on the interface and worrying about efficiency later. That said, I started using pykdtree and it's really fast! Sadly, that's not what we use for interpolation. The gridders we have are more like radial basis functions so our bottleneck is building the sensitivity/jacobian/feature matrix and then solving the least squares problem. Right now, I'm using dumb numpy operations to build the matrix and scikit-learn to do the fitting. This is really not ideal but I've been bitten by premature optimization before :wink:
Verde still has some way to go in terms of features. Right now, I have the bare minimum. This all started because I'm working on a vector interpolation method and needed some utilities and a basis for comparison. Much of what is currently here has been in GMT for a long time (greenspline and blockmean). I reimplemented them because I wanted to tweak things without having to dive into C code and use bash. In the end, I like the verde interface better and plan on adding different Green's functions for spherical gridding, etc.
One thing that I've been wanting and pyresample already has is filtering. I sometimes need to downsample gridded data and running a lowpass filter first is crucial. I have some other projects that use gridded data more often and will definitely have use cases for pyresample.
It would be really great to have a similar API at least. I'm not opposed to merging if there is significant overlap. The Verde interface is not terribly complex (see BaseGridder). A class that implements fit
, predict
, and filter
can be used with all other classes (through Chain
). For example, a spatial filtering class could have the following interface:
class GaussianFilter():
def __init__(self, width):
self.width = width
def filter(self, coordinates, data, weights=None):
"""
Run the filter on the data and output new coordinates and data.
Can assume that data is 2D and raise exception if not.
"""
...
It could then be used to filter some gridded data before resampling in a pipeline:
data = xarray.open_dataset(...)
chain = vd.Chain([
("filter", GaussianFilter(width=100e3)),
("resampler", Resampler(area_def)),
])
# Fit will filter the data before passing it to Resampler
chain.fit((data.longitude, data.latitude), data.scalars)
# The filter doesn't impact the data prediction so it's not called when gridding
new_grid = chain.grid(region=..., spacing=...)
Is this something that would be interesting to you?
I feel like @jrleeman and @dopplershift might want to pitch in as well.
Then again, the filter is something that would be much better implemented as an accessor.
We have two main use cases for pyresample (via satpy):
SwathDefinition
as the source area and AreaDefinition
as the target area. A SwathDefinition
is basically a container for a longitude array and a latitude array.AreaDefinition
to AreaDefinition
.Both are handled by the same logic when it comes to using a KDTree. We typically deal with 2D or 3D data (RGB images), but it should work for any dimensionality. Our newest very rough interface for pyresample that I don't think we've even documented on the sphinx pages comes down to:
resampler = KDTreeResampler(source_area, target_area)
resampled_data = resampler.resample(my_data, **kwargs)
The nice thing about this is that we can cache the resampling KDTree (or other algorithm's intermediate results) in the resampler object or on disk because the source_area is typically shared between multiple DataArrays (multiple channels on a satellite instrument). And with the magic of dask we also get some nice optimizations and shared processing by computing multiple results at the same time. NOTE: Not everything in pyresample is dask friendly.
One thing I like about your grid interface is that the user doesn't have to worry about projections or metered projection coordinates if they don't want to, just the spacing and the lon/lat region. We have utilities that help for accomplishing the same thing, but since we've only recently (last year or two) started working with xarray we haven't implemented everything assuming that the data is easily provided/stored in a single object. This is why I got interested in the idea of a geoxarray library (https://github.com/pydata/xarray/issues/2288).
Other useful features we've been working on in satpy are cropping (filtering I guess) an existing AreaDefinition by a metered or lon/lat bounding box or by another AreaDefinition. This lets us more easily only resample the data that will fall in the destination area.
Regarding your bad file formats that you have to deal with, that is one of the reasons we made satpy because we're in a similar situation. File formats that aren't easy to read or have some special way of being handled to get anything useful. Your pandas DataFrames can be described as xarray Dataset's can't they? Is there any reason the conversion wouldn't work?
Hi @djhoese i didn't have much time to think about this but I'm coming back to Verde now.
Your pandas DataFrames can be described as xarray Dataset's can't they? Is there any reason the conversion wouldn't work?
They probably could. Though it would require a Dataset with a non-coordinate index and at that point, there isn't really a lot of advantage over pandas/numpy.
I'm hesitant to force users to use xarrays for things that they are probably used to doing in numpy/pandas. Right now, I don't have any classes/functions that work solely on xarray. Pretty much all of them assume some sort of array-like object (hence the coordinates, data
style). The main reason is that I deal with a lot of non-gridded data and sometimes we want to do operations without having gridding data as well. Something on my TODO list is to include a method for computing derivatives on arbitrary points.
Regarding future plans, I'm trying to restrain myself from putting too much into a single package. That was my pattern for a while and it really didn't turn out to be a good idea. That's actually the reason Verde exists. So I want to keep Verde solely focused of Green's functions based interpolation and the minimum preprocessing needed (mainly BlockReduce
). There are of algorithms out there for this and I have only 2 implemented so far. There are also more technical challenges for running this on medium to large datasets. So I'll be focusing on that.
BTW, thanks for this discussion! It's really helping me think about the scope of the project. And now I know how much cool stuff is in pyresample and satpy :)
Without understanding the science and math/algorithms involved in your work, I think I get what you're saying about using xarray and totally understand the issues users might have with it.
There are of algorithms out there for this and I have only 2 implemented so far. There are also more technical challenges for running this on medium to large datasets. So I'll be focusing on that.
Do you plan on trying dask or do the algorithms need to be recoded? Or both?
Regarding future plans, I'm trying to restrain myself from putting too much into a single package.
That is probably something that satpy will need in the next major refactor after we get a 1.0 release out (@mraspaud @pnuu @adybbroe). We have a lot of parts that could be considered separate but right now they are too tightly coupled in ways I'm not sure are easy to decouple right now. Like needing a certain piece of metadata (ex. satellite name) to exist in the DataArray for the data to be best handled by other code. It is something we have discussed before but it isn't the top thing on our priority list.
I see on that blog post that you have a "geometric" library that is in the early stages of development. That could be another thing that either pyresample could help with or we could both hand off to the rasterio library (depending on your feelings on adding libgdal as a dependency for users). What kind of operations do you end up doing with these shapes?
Maybe both of our groups need to evolve a little more before real collaboration can be done easily? There is also the geoxarray project that could start this process, but if verde wouldn't benefit from xarray extension libraries then maybe not 😉. I think verde is new enough and pyresample is in need of a refactor enough that maybe we should revisit combining interfaces after we see what best serves our users. That could be another route for collaboration. Bare minimum, we'll keep each other in mind.
Without understanding the science and math/algorithms involved in your work
It's good to know that I'm not the only one :)
Do you plan on trying dask or do the algorithms need to be recoded? Or both?
The main bottle neck right now is solving a linear regression problem. I use scikit-learn for that right now. I tried using the linear algebra routines in dask but it's slower for datasets that fit in memory. So the dask solution would only be useful if I'm running out of memory. I think there is a better way of doing this by splitting the problem into windows and gridding each individually. I'm still investigating that but it would be a way to use dask without using the linear algebra parts.
I see on that blog post that you have a "geometric" library that is in the early stages of development.
Yeah, that ground to halt lately and I haven't been able to play with it much. Plus, I'll have to change the name because someone else had a better claim to the PyPI namespace so I gave it up.
What kind of operations do you end up doing with these shapes?
I don't know I'll keep working on that because I don't actually need any operations on the shapes. They are pretty much just containers and some utilities for creating collections (meshes) and plotting. I think that was a bit of over-engineering on my part. The same could probably be done with dictionaries or namedtuples without all the boilerplate. I'm waiting until I get a change to work on the code that used those classes to see if there is a real need for a geometry package.
Maybe both of our groups need to evolve a little more before real collaboration can be done easily?
Yeah, that is definitely the case for Verde. I'm probably the only user right now and even I haven't used it heavily :wink: But I'm keeping an eye on geoxarray. That's something that looks very handy!
Bare minimum, we'll keep each other in mind.
Oh yeah, I'm keeping this issue open and I'm watching pyresample to see what you are up to. You might be interested in adding pyresample and pysat to this list: https://github.com/softwareunderground/awesome-open-geoscience#geospatial
I am one of the maintainers of the pyresample project (https://github.com/pytroll/pyresample) along with @mraspaud @adybbroe @pnuu. We do some very similar operations to what you have in verde, but with different interfaces and sometimes different purposes. I think our two projects could benefit from each other.
Pyresample/SatPy
Pyresample has been around for a long time and is used by many meteorological institutes/organizations via the satpy library (https://github.com/pytroll/satpy) and mpop before that. I also use satpy in my command line tools under the polar2grid and geo2grid project and joined the PyTroll/SatPy team after attending conferences and realizing that we were solving exactly the same problems.
Pyresample and satpy now use xarray and dask for all of their operations, but pyresample is in need of a new interface now that xarray accessors are more popular and the ease of attaching geolocation information to an xarray object (see https://github.com/pydata/xarray/issues/2288). Pyresample offers (or will very soon) three different resampling algorithms: nearest neighbor using pykdtree, elliptical weighted averaging for scan-based satellite instrument data, and bilinear interpolation using pykdtree for neighbor calculations. The old pyresample API also includes the ability to use your own weighting functions, but we haven't migrated that to work with the xarray/dask functionality.
The use cases that pyresample were originally used for and what people were used to doing involved having predefined 500m-2km grids in various projections. These grids can also be dependent on the visualization tool used to view the satellite data where only certain projections can be used or certain resolutions are too dense for satellite raster images. This is my opinion/experience, the other pyresample developers may see differently. Anyway, a lot of pyresample's interfaces assume you have a predefined grid ("AreaDefinition") that you want to resample your data to. It is coming up more and more that we need an easier interface for making these definitions from other existing definitions like lon/lat arrays or adjustments to existing AreaDefinitions (change resolution). While this is all possible, it isn't always the easiest to do or access or document.
Verde
I spoke with you @leouieda at SciPy 2018 and on gitter which lead to #92 where verde now uses pykdtree if it is available for better performance. You also mentioned in your lightning talk I believe that you'd like to switch to dask where possible but haven't done so yet. This is something pyresample's experience could help with since dask-ifying a kdtree is not easy, but I think we've done the best we can without rewriting the C code.
You are developing some really nice interfaces in verde from what I've seen and make certain use cases extremely easy. Like the last paragraph I described for pyresample above. Pyresample could benefit from interfaces like these I think or both projects could benefit by adding these interfaces to xarray accessors.
Collaboration
One library, two libraries, one library that depends on another, two libraries that depend on a third, or just similar interfaces...I'm not sure what would be best. I do think that we could benefit from working together. This issue isn't requesting that either project be absorbed by the other, but to make each aware of the other.
@leouieda Where do you see verde's features overlapping with pyresample? How many more features do you still want to add to verde (assuming verde is young-ish and seeing how many PRs you are making every day)?
How far do you see verde's feature set going? What else do you want it to do? Does it/will it do more than gridding and the related utilities?