Open-EO / openeo-processes

Interoperable processes for openEO's big Earth observation cloud processing.
https://processes.openeo.org
Apache License 2.0
49 stars 14 forks source link

Process to run EO Application Packages (CWL) #507

Open clausmichele opened 1 month ago

clausmichele commented 1 month ago

run_ogc_application_package

Context

For the InterTwin project (and soon others), we would like to run an OGC Application Package inside an openEO process graph. The documentation for OGC Application Package is here: https://docs.ogc.org/bp/20-089r1.html We see it as a process similar to run_udf.

Summary

Description

Parameters

data

Optional: yes

Description

The data to be passed to the OGC Application Package execution engine. Optional since the input data could be already defined in the CWL file and therefore it wouldn't need any other inputs.

Data Type

Datacube

cwl

Optional: no

Description

Currently it's a YAML file. Either we pass is as pure text/string like for UDFs, or we pass an URL to it and the back-end loads it. The schema could be the same as for the udf parameter of run_udf with some changes.

Data Type

string

cwl_params

Optional: no

Description

It's either a YAML or JSON file. Again, it could be passed in the same ways described for the previous one.

Data Type

string

Return Value

Description

The result should be made available as a STAC object, so a JSON string. In this way, in the back-end it's possible to continue the process graph using load_stac.

Data Type

string

Links to additional resources (optional)

Examples

Currently in development: interTwin-eu/HyDroForM: Hydrological Drought Forecasting Model with HydroMT and Wflow (github.com)

cwltool --outdir ./wflow-output --no-read-only --no-match-user wflow-exp-run.cwl#run-wflow params-exp-wflow.yaml

OR something like this:

(very experimental, uses sapporo service: sapporo-wes/sapporo-service: A standard implementation conforming to the Global Alliance for Genomics and Health (GA4GH) Workflow Execution Service (WES) API specification. (github.com) )

curl -X POST http://localhost:1122/runs \
     -H "Content-Type: multipart/form-data" \
     -F "workflow_params=[https://raw.githubusercontent.com/interTwin-eu/HyDroForM/experimental/experimental/hydromt/cwl/params.json;type=application/json"](https://raw.githubusercontent.com/interTwin-eu/HyDroForM/experimental/experimental/hydromt/cwl/params.json;type=application/json) \
     -F "workflow_type=CWL" \
     -F "workflow_type_version=v1.2" \
     -F "workflow_engine=cwltool" \
     -F "workflow_url=[https://raw.githubusercontent.com/interTwin-eu/HyDroForM/experimental/experimental/hydromt/cwl/hydromt-build.cwl"](https://raw.githubusercontent.com/interTwin-eu/HyDroForM/experimental/experimental/hydromt/cwl/hydromt-build.cwl) \
     -F "workflow_attachment=[https://raw.githubusercontent.com/interTwin-eu/HyDroForM/experimental/experimental/hydromt/cwl/hydromt-build.cwl;type=application/octet-stream"](https://raw.githubusercontent.com/interTwin-eu/HyDroForM/experimental/experimental/hydromt/cwl/hydromt-build.cwl;type=application/octet-stream) \
     -F "workflow_attachment=[https://raw.githubusercontent.com/interTwin-eu/HyDroForM/experimental/experimental/hydromt/cwl/update-config.cwl;type=application/octet-stream"](https://raw.githubusercontent.com/interTwin-eu/HyDroForM/experimental/experimental/hydromt/cwl/update-config.cwl;type=application/octet-stream)

I put in cc the people from Eurac working on this @jzvolensky @iacopoff @aljacob

And I am aware VITO is also interested: @jdries @soxofaan EODC @christophreimer

jdries commented 1 month ago

+1 we will probably start implementation work on this still in 2024 (I hope) For cwl_params, I'm wondering if we can find a solution that makes it look more like how other openEO processes specify parameters? One idea could be that we simply interpret all extra process arguments as cwl parameters.

The other difficult thing is how data goes in and out. STAC is for sure the solution, but it needs constraints to be usable. Also thinking if it is possible to avoid constructions where process graphs have to be very explicit about converting datacube to stac, running AP, and then reading back from STAC, or if we can have (a variant?) of run_ogc_application_package that simply works for rastercube input/output.

m-mohr commented 1 month ago

That sounds pretty reasonable. The return value should probably be a data cube (or the new stac subtype, see #485).

Here's a reference to an old PR, which had similar aims and has some discussion already: https://github.com/Open-EO/openeo-processes/pull/332

One idea could be that we simply interpret all extra process arguments as cwl parameters.

That's not a thing in openEO, primarily because not all programming language have a construct such as kwargs in Python.

m-mohr commented 1 week ago

I was just wondering whether CWL could just be another UDF runtime and whether we could use run_udf? @clausmichele

clausmichele commented 1 week ago

Maybe @jzvolensky can help, he's our OGC AP expert. I guess in this case we can't pass a single code block which contains everything, definition and input parameters to run an AP?

jzvolensky commented 1 week ago

@clausmichele I am not sure how that would work with the ADES. Since the CWL processes are stored in the ADES I suppose they could be read in a UDF and then you provide the input parameters in the UDF and then send the processing request to ADES? Maybe this is something we can look at/think about.

m-mohr commented 6 days ago

What is ADES in our context here?

I did assume that you'd specify a CWL file and there happened no interaction before to store the CWL.

jzvolensky commented 6 days ago

Sorry, ADES is the Application Deployment and Execution Service from the EOEPCA project. Basically a CWL execution engine which also supports managing CWLs (deploy, undeploy etc.). Our idea is to plug this into OpenEO so that with a process or possibly a UDF? we can then execute Application Packages. In this way we can have a set of predefined processes available to the user, or possibly allow the user to provide their own.

m-mohr commented 6 days ago

The specification should be independant of the implementation. So ADES might be a data point, but we should probably focus on the underlying specification (i.e. OGC API - Processes - Part 2/3). Plugging that in makes sense, but in the end a CWL could also be just a specific "language" to express UDFs in, similar to Python or R.

jzvolensky commented 3 days ago

I was just wondering whether CWL could just be another UDF runtime and whether we could use run_udf? @clausmichele

Hello, so I looked at the run_udf process spec, I guess this could work. Just to understand it correctly, you would for example run_udf and pass the CWL (file, url, whatever), as well as inputs (yaml or json) for the CWL with a runtime set cwl1.2 and then the runtime would do whatever it needs to do in the backend to execute and return result?

m-mohr commented 2 days ago

Yeah. If we are reusing run_udf instead of a new process, it could look as follows in a process graph:

{
  process_id: "run_udf",
  arguments: {
    udf: "... CWL as YAML or URL or string ...",
    runtime: "cwl",
    version: "1.2", // could be omitted as it's the default version, see below
    context: {
      cwl_param1: true,
      cwl_param2: 99
    }
  }
}

While GET /udf_runtimes lists:

{
  title: "EO Application Packages (CWL)",
  type: "language",
  default: "1.2",
  versions: {
    "1.2": {
      libraries: { ... } // not sure about this entry. I guess it could pre-loaded docker images or so?
    }
  }
}

It's just an idea that doesn't need an explicit process. If people think it would make sense to have a separate process, we can also discuss that. But right now I don't see an explicit reason why that might be better. Please let me know if you have any reasons in mind.

Somewhat related issue: #515

Also, run_udf is usually meant to be executed in datacube processes such as reduce_dimension. This would not be the case for EO Application packages I guess, which is somewhat against the best practice of UDFs. It's somewhat unclear how a mapping from the EO Application Packages and the openEO data types can be achieved and communicated to users.

Related process: run_udf_externally

jzvolensky commented 2 days ago

Okay, the first part looks really neat with defining the workflow and inputs.

in the second GET /udf_runtimes Do you mean just to list the available docker images? Unless we extract them from the CWLs, this is not information which we/user needs to define, it is defined in the CWL, and it doesn't really provide any added value to store this, I think.

The last paragraph is interesting. I mean the Application Packages are fully standalone applications right. From this point of view a new process makes sense, because the application and execution of it is outside of your traditional process graph scope. All that we do is bind it together with the rest of openeo processes chain using a process graph (however in theory we don't need to use any other process to use it, so it really can be a standalone process).

I do like the UDFs idea and if the UDF can support this with some minor best practices update or a general UDF use case extension then that is good, I suppose.