An Pangeo/ODC-based backend that runs "out of the box"

m-mohr commented 2 years ago

(This is a discussion, not a vote)

We had quite some interest from the regional ODC initiatives (Swedish, Brazil, Swiss) but we couldn't really offer an easy solution. We then started documenting the steps at https://openeo.org/documentation/1.0/developers/backends/opendatacube.html but it's still combining a lot of different things and it seems not so easy overall. Therefore, the idea came up again today to have someone work on an ODC or Pangeo tech stack-based implementation that you can set up more easily. It might just be that it needs some work to better connect the different repos, but I'm not working with it so I don't know the exact obstacles. I already heard that there could be funding opportunities.

But first of all, I think we should bring together all people working on related projects and check their status. I'm specifically thinking about discussing at least with:

@clausmichele (EURAC) as contributor to openeo-odc and related repos
@ValentinaHutter (or someone else) from EODC for their implementation
@jonas-eberle (DLR) who seems to have an openEO implementation around the same tech stack
@danielFlemstrom (RISE) who is developing also an implementation around ODC

Who would be interested in this?

cc @Open-EO/psc

clausmichele commented 2 years ago

Good point! I would be interested in this! I think that currently the python xArray/Dask processing part is evolving nicely, but there is still a lack of a standardization in the back-end serving the openEO REST API. For instance, at Eurac we use the java based spring driver and I suppose at DLR as well (Prateek, one of the original developers, now work there). EODC is using a Python back-end and I do not know about RISE.

aljacob commented 2 years ago

I think this is definitely interesting and we should come up with some consolidated environment for quick deployment for this. This would definitely foster further uptake of openEO on the back-end provider side as well.

jonas-eberle commented 2 years ago

+1 for making it simpler and to consolidate the environments. I would also like to hear the pro's and con's about Java-based spring driver from EURAC and Python-based driver from EODC. We have also some experiences with using STAC API as a central data catalog used within xarray (Open Data Cube, stackstac).

ValentinaHutter commented 2 years ago

Sounds interesting, would be happy to join a call on it, if you plan to have one.

christophreimer commented 2 years ago

@m-mohr please add me to the discussion/meeting as well. I think we have to discuss the various components and/or software packages involved in an OpenDataCube / Pangeo backend for openEO

m-mohr commented 2 years ago

Yes, will do. Will send around a Doodle soon. Please note that this will not happen before April, maybe even only in May.

jonas-eberle commented 2 years ago

@m-mohr Hi Matthias, any update on this topic? Please also include @johntruckenbrodt - a new colleague at DLR who will work on xarray/Dask.

m-mohr commented 2 years ago

@jonas-eberle No, not yet. We need to find funding for this and then also find someone who has time to work on it. If anyone can contribute, please let us know.

jonas-eberle commented 2 years ago

Maybe it would be good to just start the discussion in a meeting - collecting some needs, requirements, ideas. This will also help us to estimate the amount of work.

christophreimer commented 2 years ago

Hi @jonas-eberle, we at EODC do have an Pangeo/ODC based backend around already. In addition we plan to release an "out-of-the-box" mini Pangeo/ODC backend by end July. My colleague @LukeWeidenwalker is currently working on that. Support for ODC is not foreseen to be included in this mini backend. If you have any questions, ideas or requirements, we are happy to discuss those with you.

TomAugspurger commented 2 years ago

Apologies for naive question, but can someone explain what we mean by "pangeo / ODC" stack here? Specifically, I'm wondering if pangeo = "JupyterHub + Dask on Kubernetes" in this context, or whether it's essentially just "xarray (plus rioxarray, etc.)".

christophreimer commented 2 years ago

@TomAugspurger we foresee to release a ready-to-use environment containing

openEO processes implemented with xarray and
the corresponding Dask packages packaged in Docker to run locally.

This is one of our building blocks we use at EODC to realize an openEO backend. The EODC openEO backend makes use of the same openEO processes implemented in xarray, but allows for execution of those via Dask on k8s.

m-mohr commented 2 years ago

Not directly Pangeo/ODC/xarray related, but here's also documentation about VITO's back-end as local docker: https://github.com/Open-EO/openeo-geotrellis-kubernetes/blob/master/examples/run_local.md

rabernat commented 2 years ago

Nice to see this effort happening!

Pangeo provides docker images which it may be useful to build on here: https://github.com/pangeo-data/pangeo-docker-images

benbovy commented 2 years ago

I just want to mention https://github.com/xarray-contrib/xpublish here, which might be useful for implementing a backend based on Xarray and Dask. It is a lightweight tool built on top of FastAPI for developing flexible REST APIs reusing Xarray datasets. Not sure it's 100% relevant, though, as I don't know much about the Open-EO backend architecture.

Also, not really related to this topic but I suggested the idea of an Open-EO Xarray front-end in https://github.com/Open-EO/openeo-python-client/issues/334. So that one could use Xarray for interacting with an Xarray-powered Open-EO backend 🙂

m-mohr commented 2 years ago

Thanks for the ideas. We'll look into them.

@SerRichard @LukeWeidenwalker something for you, right?

christophreimer commented 2 years ago

@benbovy thx for pointing us to https://github.com/xarray-contrib/xpublish. The openEO use case is a different to the one covered by xpublish. openEO is more about applying various processes (functions) on data directly in a given backend.

An openEO xarray frontend, or client, is not meaningful in my opinion, since openEO makes use of the concept of process graphs, which are an abstract representation of a workflow (chain of functions) not known by xarray. However, the openEO backend of EODC is based on xarray/dask as described above. We will offer the option to directly request compute resources via dask to execute any dask compliant computation on this backend with access to the provided data.

benbovy commented 2 years ago

Thanks for your feedback @christophreimer.

I think that Xpublish's features could be better advertised. Although initially it has been developed to "publish" Xarray datasets and serve data via a Zarr-compatible REST API, it can now be used to facilitate the development of any REST API that is internally consuming one or more Xarray datasets. For example, you could write something like this for the backend application:

import xarray as xr
from fastapi import APIRouter, Depends
from xpublish.dependencies import get_dataset

# note: the `get_dataset` FastAPI dependency takes a dataset name or id
# (str) as input and returns an actual xarray.Dataset object

openeo_router = APIRouter()

@openeo_router.get("/process")
def process_dataset(
    dataset: xr.Dataset = Depends(get_dataset),
    graph: ProcessGraph,
) -> ProcessResult:
    result = execute_graph(dataset, graph)
    return result

# data collection as a dict of xarray.Dataset objects
ds_collection = {
    "S1_GRD": xr.open_dataset("some_zarr_or_netcdf_dataset"),
}

rest = xpublish.Rest(
    ds_collection,
    routers=[openeo_router],
)

rest.serve("127.0.0.1:8000")

A client app might then send a request like "https://127.0.0.1:8000/S1_GRD/process?graph=..."

Note: I've only scratched the surface of OpenEO, so I assume it is a very naive, simplistic, incomplete, but hopefully meaningful example 🙂 .

Xpublish is a very lightweight wrapper around Xarray, FastAPI (and Cachey), though. To be fair, writing an Xarray-based OpenEO backend without relying on Xpublish would probably not represent much more effort.

benbovy commented 2 years ago

An openEO xarray frontend, or client, is not meaningful in my opinion

On this point I respectfully disagree, although I might be missing some important aspects of OpenEO.

OpenEO datacubes and Dask arrays share this aspect in common that they both hold a graph representing a computation workflow, which may grow by calling appropriate methods and operators on them, and which we can send to a backend (openEO) or a scheduler (Dask) and get the results.

Xarray dataarrays may wrap / handle Dask arrays as well as other kinds of "duck" arrays without knowing anything about what they actually represent. Why not OpenEO datacubes then, provided that these implement the API needed to behave like duck arrays? Xarray could also make its .compute() method more abstract, e.g., so that it could be reused for sending OpenEO jobs and retrieve the results.

One caveat is that OpenEO datacubes and Xarray dataarrays are both label-aware, unlike the array types that Xarray usually wrap. Not sure yet how exactly we can make the two work together nicely.

christophreimer commented 2 years ago

OpenEO datacubes and Dask arrays share this aspect in common that they both hold a graph representing a computation workflow, which may grow by calling appropriate methods and operators on them, and which we can send to a backend (openEO) or a scheduler (Dask) and get the results.

Yes fully support this argument. Both share the idea of representing a computation workflow as some kind of graph. Differences I do see is how they graph is represented. openEO process graph is represented as JSON, human-readable, support by almost all programming languages, covering a specific set of data types. Dask task graphs are serialized mainly via pickle/cloudpickle in binary format, Python specific, support for extended set of data types in Python (class, methods, functions) compared to JSON.

Xarray dataarrays may wrap / handle Dask arrays as well as other kinds of "duck" arrays without knowing anything about what they actually represent. Why not OpenEO datacubes then, provided that these implement the API needed to behave like duck arrays? Xarray could also make its .compute() method more abstract, e.g., so that it could be reused for sending OpenEO jobs and retrieve the results.

Not sure if I get your point on that topic. I will try to explain my point of view. OpenEO datacubes are only describing a theoretical concept of datacubes. An openEO backend implements exactly this theoretical concept utilizing technologies such as xarray, dask, spark etc.. In order to do so, each defined openEO process needs to be mapped to a corresponding implementation in the given technology.

benbovy commented 2 years ago

OpenEO datacubes are only describing a theoretical concept of datacubes

This is indeed what I understand, and this is similar to other libraries that expose "virtual" arrays or array wrappers... including Dask and Xarray! For example, for a Dask array each task on each chunk needs to be mapped to a corresponding implementation in Numpy (or Sparse or Cupy, etc.). An Xarray dataarray may wrap a Dask array that wrap Numpy arrays, etc. This can even get more complicated with multiple nested array wrappers, like shown in https://github.com/pydata/xarray/issues/5648, so for Xarray the actual computation may happen at a deep level. OpenEO datacubes are also lazy (i.e., no actual computation happens until the client sends the graph of processes to the backend), like Dask arrays. Xarray has already 1st class support for Dask arrays so in theory it could be expanded to support other lazy arrays / cubes.

EDIT: Sorry I brought this Xarray client topic in https://github.com/Open-EO/PSC/issues/16#issuecomment-1277392256 but it is different from the issue here. Perhaps it'd be better to continue this discussion in https://github.com/Open-EO/openeo-python-client/issues/334.

Open-EO / PSC

An Pangeo/ODC-based backend that runs "out of the box" #16