Open-EO / openeo-api

The openEO API specification
http://api.openeo.org
Apache License 2.0
92 stars 11 forks source link

Federation Extension #419

Closed m-mohr closed 2 years ago

m-mohr commented 3 years ago

A first draft for the federation extension, all up to discussion, it's really just capturing ideas right now.

Rendered version: https://github.com/Open-EO/openeo-api/blob/federation-extension/extensions/federation/README.md

Related issue: https://github.com/openEOPlatform/architecture-docs/issues/188

soxofaan commented 2 years ago

just FYI: the aggregator now includes this under / as initial implementation (versioned back-end urls):

"federation": {
    "eodc": {
      "url": "https://openeo.eodc.eu/v1.0/"
    },
    "vito": {
      "url": "https://openeo.vito.be/openeo/1.0/"
    }
  },
soxofaan commented 2 years ago

Another thing to consider: in the aggregator there is currently a priority difference between backends (currently VITO and EODC), which is used when there is a (metadata) "conflict" (e.g. title of a "merged" collection) or there is no obvious back-end for a query (e.g. https://twitter.com/matthmohr/status/1451589229721661444). At the moment the VITO back-end has highest priority.

It might be relevant to formalize this priority better here, e.g. with an explicit back-end order, or with defining a main/reference back-end

m-mohr commented 2 years ago

Long-term I think there should be no order. Process differences could partly be solved by checking against the official specification. And all other cases probably would work best if they are not solved automatically but instead be logged somehow so that we can check what the issue is and solve them. To some degree, we may also need to specify processes and collections "manually" if an automatic merge doesn't work well (e.g. different descriptions with warnings/notes per back-end). And jobs should be forwarded somewhat evenly if they can run on multiple back-ends.

m-mohr commented 2 years ago

From the meeting today: It seems we also need a flag that the user can set for batch jobs/sync proc/web services so that they can choose explicitly which back-end to send a process graph to (e.g. similar to the billing plan option). Or should this be done via load_collection properties? The latter seems a bit weird.

soxofaan commented 2 years ago

Or should this be done via load_collection properties? The latter seems a bit weird.

indeed, that feature is focused on specifying the backend to use for providing a particular collection, which currently directly implies the back-end to execute the whole graph

It seems we also need a flag that the user can set for batch jobs/sync proc/web services so that they can choose explicitly which back-end to send a process graph

Another idea that crossed my mind is having back-end specific connection urls in addition to the general one, e.g.

so you can interact with an explicitly chosen backend, but still use the general platform features (auth, billing, ...). Putting a back-end identifier in the base url avoids having to define some kind of selection field in multiple end-points. And you can easily provide collection/process discovery without all the tricky metadata merging.

m-mohr commented 2 years ago

Another note from the meeting today: Do we need a way to expose for collections and processes for which execution type (batch, sync, web services) they are (not) available?

I'm not sure we need it here or in the core API?

soxofaan commented 2 years ago

Do we need a way to expose for collections and processes for which execution type (batch, sync, web services) they are (not) available? I'm not sure we need it here or in the core API?

sounds like something that is not (only) related to federation

m-mohr commented 2 years ago

Added a new issue: https://github.com/Open-EO/openeo-api/issues/429

m-mohr commented 2 years ago

Another idea that crossed my mind is having back-end specific connection urls in addition to the general one [...] so you can interact with an explicitly chosen backend, but still use the general platform features (auth, billing, ...). Putting a back-end identifier in the base url avoids having to define some kind of selection field in multiple end-points. And you can easily provide collection/process discovery without all the tricky metadata merging.

Yeah, I see pros and cons for both approaches. I think we need to discuss this in detail and check what is the better way forward.

Some that come to mind right now:

Back-end URLs: Pros: Avoids merging metadata (could be mitigated by federation extension and some hand-drafted metadata) Cons: You need to reconnect each time

Flag in requests: Pros: You need to connect once (think web editor) and then can select each time easily in the UI (but this is less easy in programming libraries) Cons: Needs changes in clients

soxofaan commented 2 years ago

Cons: You need to reconnect each time

I don't think this will a big issue in practice: the choice to work on a specific back-end instead of working with the federated one will be very conscious decision, so it's a good thing to make that clear. Moreover, a lot of things are impacted from switching between the federated backend and the underlying backends: collections, processes, file formats, udf runtimes, ... (as discussed elsewhere), so I think it's a good thing that this is reflected "UI-wise" by having to create a separate connection.

m-mohr commented 2 years ago

Indeed, this might be biased from a Web Editor standpoint although as a user I'd personally still prefer to have a simple switch. This could be something to leave out for now and then ask our users what they would prefer.

m-mohr commented 2 years ago

Merged for now, let's discuss further issues in separate PRs.