Open-EO / openeo-api

The openEO API specification
http://api.openeo.org
Apache License 2.0
91 stars 11 forks source link

Multiple end nodes #427

Closed m-mohr closed 1 year ago

m-mohr commented 2 years ago

Yes, I think so. The API says "multiple end nodes are possible".

It currently says indeed

One of the nodes .. (the final one) MUST have the result flag set to true, ... Having such a node is important as multiple end nodes are possible,

There s is some room for interpretation here. It doesn't explicitly say that a backend must fully evaluate non-result end nodes. Analogy: a C or Java program can have multiple functions, but only main is triggered when the program is executed and there is no guarantee that all code will be visited.

The obligation to execute all end nodes instead of just the "result": True node has quite a big impact:

I think it should be made more explicit in the API description how to handle a process graph with multiple end nodes. From the above it should be clear that I favor the "only execute the result node" model. I think it's a simpler model, more straightforward to implement in both clients and back-ends and allows cleaner client API's for the end user.

Note that it still possible to "emulate" multiple result nodes in this model: you just collect all your result nodes in a final array_create process that acts as single result node. If necessary it could be handy to define a dedicated process for this (e.g. collect) at the API level or just provide a helper at the client level.

Originally posted by @soxofaan in https://github.com/Open-EO/openeo-processes/issues/279#issuecomment-961143786

m-mohr commented 2 years ago

Coped over from #279, posted by me:

Examples:

Store intermediate results in a batch job: image

Store result in multiple formats without having to re-execute same job: image

(drop_dimension is just an example, that could be any other processing step)

You could also use debug instead of save_result for the "leaf"/end nodes.


The API says "multiple end nodes are possible". The result flags were introduced for "callbacks" so that a "return value" can be detected, they don't really have a meaning in the "top-level" when run as a job or so. That's of course different if it's stored as a UDP, then it's top-level, but is likely to be used as a callback again... And as we don't know the context beforehand, also the top-level has this result flag although not always strictly required.

m-mohr commented 2 years ago

There s is some room for interpretation here. It doesn't explicitly say that a backend must fully evaluate non-result end nodes. Analogy: a C or Java program can have multiple functions, but only main is triggered when the program is executed and there is no guarantee that all code will be visited.

I don't buy this analogy. main is an entry point, comparable to providing a process graph in POST /results where you know your code is in the process.process_graph property. And then everything inside is executed. In Java / C it's totally fine to what we have in the examples above. All of the nodes would be executed although something completely unrelated might be returned (except for optimizations based on conditions for example).


  • the process graph parsing [...] heavily depends on the assumption that there is only one final node to evaluate

That's a somewhat "optimistic" assumption when the docs say "multiple end nodes are possible". Did you assume that the users add end nodes without a purpose? 🤔

The JS PG parsing library does it this way:


I think it should be made more explicit in the API description how to handle a process graph with multiple end nodes.

Actually, it is (although while reading it I see room for improvements). The behavior (that I also mentioned above for JS) is documented in the chapter "Data Processing > Execution"

To process the process graph on the back-end you need to go through all nodes/processes in the list and set for each node to which node it passes data and from which it expects data. In another iteration the back-end can find all start nodes for processing by checking for zero dependencies.

You can now start and execute the start nodes (in parallel, if possible). Results can be passed to the nodes that were identified beforehand. For each node that depends on multiple inputs you need to check whether all dependencies have already finished and only execute once the last dependency is ready.

Please be aware that the result node (result set to true) is not necessarily the last node that is executed. The author of the process graph may choose to set a non-end node to the result node!


Note that it still possible to "emulate" multiple result nodes in this model: you just collect all your result nodes in a final array_create process that acts as single result node. If necessary it could be handy to define a dedicated process for this (e.g. collect) at the API level or just provide a helper at the client level.

That sounds rather unintuitive to me.

jdries commented 2 years ago

I would say that's it's nice that the spec allows backends to implement this advanced case, if needed. In the meanwhile, we'll just wait for some actual user request before considering to implement this (in geotrellis backend). As explained, we're currently mostly focusing on single node graphs. When we'll consider cases like saving intermediate results, we'll also want to figure out how to properly name the resulting assets to properly distinguish end results from intermediate ones.

m-mohr commented 1 year ago

Here's a potential use case: https://discuss.eodc.eu/t/obtaining-multiple-variables/522