Processing BB's: split resource orchestration from processing runners

jdries commented 5 days ago

In followup of our discussion yesterday, I'd like to propose a clarification to the description of the 'processing' concept. Right now, there's an important issue in the 'Processing Runner' explanation: it mixes resource orchestrators with various workflows libraries, and then goes on to claim that engines should be able to 'plug-in' all of these.

Here's a proposal that's closer to reality:

Resource orchestrators: Kubernetes, Slurm, Apache YARN, Docker These are components that manage a pool of virtual or physical hardware resources. Nowadays and for the foreseeable future, all of these belong to the more specific category of container orchestrators or at least have the capabilities to support containers. Most organizations only support one of these options, so it is important to be able to run on the most popular container orchestration platforms if we want to remain relevant.

Processing runner: Airflow, Dask, Spark, Argo, Knative

These are very specific processing libraries & components, that are often deeply integrated in a specific technology stack or architectural paradigm.

Airflow supports step-by-step file based workflows
Argo supports step-by-step file based workflows tightly coupled to Kubernetes
Dask: large scale distributed in-memory processing focusing on the Python ecosystem
Spark: large scale distributed in-memory processing focusing on the JVM & Python ecosystem

All of these alternatives work on one or more orchestrators, or can often be deployed in a 'static' setup without dynamic resource scheduling, which is less relevant for real-world EO processing.

Processing engines can use these components to implement openEO or CWL support, but an engine built around Spark will not somehow work with Dask or Airflow. It would also make sense to indicate that Dask and Spark style libraries are a good fit for the openEO model, while Argo and Airflow are closer to what CWL is doing. Indicating that in our architecture might make it a lot easier for the reader to understand the type of processing that is targeted by these two alternative API's.

rconway commented 5 days ago

As stated in the system architecture doc - the modular approach was defined to seek opportunities for common design and implementation across the Processing BB variants - i.e. reusable components - ref. Processing Concepts.

Most importantly we identified the Engine that implements the standard API, and the Runner(s) that interface with the offering of the underlying platform. We want the Engine part to be reusable, such that it can be integrated with different platforms/backends via pluggable Runners. This is the approach that the zoo-dru API Processes takes, which allows to plugin a Kubernetes runner (via Calrissian) and also an HPC/Slurm runner - without the need to modify the Engine part that is universal regardless of backend.

This made sense for API Processes - maybe the same is not true for openEO.

jdries commented 5 days ago

Your example of zoo-dru confirms my point: container orchestrators can run 'anything'. Software like spark, calrissian, airflow, dask focuses on specific types of processing architectures, and in a number of cases support different container orchestrators.

So we need to add an extra layer to the diagram, and there is some opportunity to indicate which well-known technologies are (generally speaking) suitable to implement specific API's.

fabricebrito commented 5 days ago

FYI: for the API Processes processing engine, we've started documenting the processing runners (WIP), see https://eoepca.github.io/document-processing/design/processing-runner/kubernetes/

EOEPCA / system-architecture

Processing BB's: split resource orchestration from processing runners #7