Open-EO / openeo-python-client

Python client API for OpenEO
https://open-eo.github.io/openeo-python-client/
Apache License 2.0
153 stars 41 forks source link

Idea: Xarray interface #334

Open benbovy opened 1 year ago

benbovy commented 1 year ago

I stumbled upon the Open-EO project a few times and find it very interesting. As a Xarray developer I've been wondering whether it would benefit from an Xarray interface.

I see that Xarray is already used here, but as far as I understand it is for defining user defined functions. My suggestion is rather interacting with Open-EO directly via the Xarray API, which may be complementary to using Xarray for UDFs.

New Xarray developments are going towards very flexible containers, with the recent addition of IO backends, alternative array backends (cupy, sparse, pytorch...), alternative parallel execution backends, flexible indexes (https://github.com/pydata/xarray/discussions/7041, https://github.com/pydata/xarray/projects/1), and accessors.

Leveraging Xarray's flexibility, I can imagine something like this:

import openeo
import xarray as xr

connection = openeo.connect("https://earthengine.openeo.org")

ds = xr.open_dataset(
    connection,
    engine="openeo",
    collection="COPERNICUS/S1_GRD",
    spatial_extent={"west": 16.06, "south": 48.06, "east": 16.65, "north": 48.35},
    temporal_extent=["2017-03-01", "2017-06-01"],
    bands=["VV"],
)

# internally calls DataCube.filter_temporal() via a custom Xarray Index
# attached to dataset and returns a new xarray.Dataset
ds_march = ds.sel(time=slice("2017-03-01", "2017-04-01"))

# internally calls DataCube.mean_time()
ds_mean_march = ds_march.mean("time")

# sends the processing job, waits for its execution and downloads
# the result into a new xarray.Dataset
ds_result = ds_mean_march.compute()

# or only sends the job...
ds_mean_march.persist()

# ...and later waits for the job to finish its execution and downloads the result
ds_result = ds_mean_march.load()

I think that such an Xarray interface could be built on top of this client library (perhaps in another repository). To make things easier, ideally openeo.DataCube would need to implement some duck array API.

The main advantage is that users can interact with OpenEO using an API they are already familiar with (assuming they already know about Xarray). They can also further process the data (results) locally using the same interface.

This is a very rough idea that I just wanted to share here, though. I'm pretty sure that providing an Xarray interface would represent quite some work with lots of challenging issues (and likely things to address in Xarray). I'd be happy to read what you think about this idea! (Sorry, I'm not sure if it is the right place here for discussing this)

m-mohr commented 1 year ago

On the other hand, if the Array API specification gets adopted in the Python world, that might be the more general choice: https://data-apis.org/array-api/latest/API_specification/index.html

jdries commented 1 year ago

Interesting idea! In fact, we already try to do something similar to offering this array api with out 'band math' functionality. But there the user has to mix some openEO specific things with other operations. It would certainly help our users (and us) if we can simply implement a well documented or even standardized API.

benbovy commented 1 year ago

On the other hand, if the Array API specification gets adopted in the Python world, that might be the more general choice

Yes that would make things even easier! If OpenEO datacubes implement (a part of) the Array API Standard, then many things should already work seamlessly through Xarray (linking a discussion about testing the integration of any Array class with Xarray: https://github.com/pydata/xarray/issues/6894).

For OpenEO datacube integration with Xarray, there is a few more things we could consider beyond the Array API:

soxofaan commented 1 year ago

Interesting idea indeed. Could be valuable to align with some kind of standardized array/cube API .

clausmichele commented 1 year ago

@benbovy maybe you could be interest in the client side processing activity we are working on. The main idea consist in allowing the user to process data with openEO processes locally and not only in the cloud, using the xArray implementations of the openEO processes. This is the PR https://github.com/Open-EO/openeo-python-client/pull/338 and there is a draft notebook showing some implemented functionalities, that you can look at in this rendered notebook: https://gistcdn.githack.com/clausmichele/9e2cf9589f6392262bc8626bb7e12a32/raw/b487860a4e8cdb2e7dac6402796eb8fbdefca2ce/client_side_proc_sample.html