ga4gh / workflow-execution-service-schemas

The WES API is a standard way to run and manage portable workflows.
Apache License 2.0
82 stars 38 forks source link

Propose Enhancements from Sapporo WES 2.0.0 #220

Open suecharo opened 4 months ago

suecharo commented 4 months ago

We have completely rewritten Sapporo, a WES implementation, and released it as version 2.0.0. This new Sapporo is based on WES 1.1.0 and is implemented using Python's FastAPI. We first defined types based on WES 1.1.0, as seen in sapporo/schemas.py, and built the functionalities from there. Our goal was to avoid extending the original WES whenever possible. However, there were some necessary extensions that we implemented, which are summarized in this issue. We would be delighted if these extensions could be integrated into the original WES.

1. Fix ServiceInfo - default_workflow_engine_parameters Type

Since WES 1.1.0 now supports multiple workflow engines within a single WES instance, this extension was necessary. The expected structure is as follows (example):

{
  "default_workflow_engine_parameters": {
    "cwltool": [
      { "name": "--cpu", "value": "2" },
      { "name": "--mem", "value": "4G" }
    ],
    "cromwell": [
      { "name": "--cpu", "value": "4" }
    ]
  }
}

2. Add sort_order and state Query Parameters to GET /runs

To enhance the GET /runs endpoint, we propose introducing query parameters for sorting and filtering runs:

  1. Sort Order: Add a sort_order query parameter to sort runs based on start_time in either ascending or descending order. The value should be "asc" or "desc", with the default being "desc".
  2. State Filter: Add a state query parameter to filter runs based on their state, such as "COMPLETE", "RUNNING", etc.

3. Enhancing POST /runs to Support Download Workflow Attachments

Currently, workflow_attachment only supports attaching files via form-data. To improve its flexibility and functionality, we suggest the following enhancements:

[
  { "file_name": "path/to/file", "file_url": "https://example.com/path/to/file" }
]

4. Enable Downloading of Run Outputs

We propose adding endpoints to download run outputs in various formats:

  1. Retrieve Outputs List: GET /runs/{run_id}/outputs should return the outputs (outputs field of the response from GET /runs/{run_id}) in JSON format.
  2. Download Outputs as ZIP: GET /runs/{run_id}/outputs?download=true should return the outputs as a ZIP file.
  3. Download Each Output File: GET /runs/{run_id}/outputs/{path_to_file} should allow downloading a specific output file.
patmagee commented 3 months ago

Hey @suecharo thank you so much for contributing these ideas! Changes that have been made to support real world use cases are highly valuable as they tend to represent a real gap in the specification.

1. Fix ServiceInfo - default_workflow_engine_parameters Type

Ah yes, I can see we totally missed this in the implementation of #182, or at least combined the engine and the version into a single string. It also looks like we do not communicate the default_engine in the service info.

In regards to your specific proposal the changes to the schema would constitute a breaking change and would necessitate a bump to WDL v2.0.0. While I am generally a proponent for moving to a 2.0 (after a suitable amount of research on current gaps), I think it is likely 1+ years away.

I would modify what you have proposed to be non-breaking. Maybe something like:

{
  "default_engine_parameters": [
    { "name": "--cpu", "value": "4", "engine": "{engine_id/verison}" }
  ]
} 

2. Add sort_order and state Query Parameters to GET /runs

This is logical to me (and something we also have implemented on our own WES engine). Although I would tweak a few of the recommendations

Sort Order: Add a sort_order query parameter to sort runs based on start_time in either ascending or descending order. The value should be "asc" or "desc", with the default being "desc".

WE have gone through various iterations of a field like this on our API, and have settled on a specification that looks like the following:

sort=(<PROPERTY>:)?<DIRECTION>(,(<PROPERTY>:)?<DIRECTION>)

This is more complex, definitely. IT filled the use case where we wanted to sort by multiple columns in multiple different directions. if ONLY the <DIRECTION> was defined, we default to start_time as the assumed property, otherwise we allow root keys in the RunLog object. (we also extended to allow sorting by tags but that is way more complex).

Would you be open to a format like that? We could simplify for now to be:

sort=(<PROPERTY>:)?<DIRECTION>

3. Enhancing POST /runs to Support Download Workflow Attachments

A few questions around this, especially given that the original workflow_attachments was a concession and not ever desired to keep in the spec long term

  1. What would be the use case vs just embedding the URL in the inputs and relying on the engine pulling down the file? Is there an assumption that the
  2. What is the expectation around authorization / authentication with the files? Does it only implicitly work for public URIs, or is that up to the specific engine
  3. Is this something that enabling a tighter nit integration with DRS could solve?

4. Enable Downloading of Run Outputs

I definitely understand where the need to download outputs came, (we encountered a similar problem)

Retrieve Outputs List: GET /runs/{run_id}/outputs should return the outputs (outputs field of the response from GET /runs/{run_id}) in JSON format.

This makes sense and I think is relatively easy to implement.

Download Outputs as ZIP: GET /runs/{run_id}/outputs?download=true should return the outputs as a ZIP file.

Is this output downloading the outputs.json returned by the outputs endpoint or a zip containing the actual bytes of the outputs. If this this is a zip of the bytes of the outputs, I do not think it would be a good idea to include this in the spec in-general, at least not as a required endpoint (despite totally understanding the need). I have multiple concerns:

  1. Outputs of a workflow can often be 10s - 100s of GB. Zipping this amount of content is not trivial and cannot easily be done interactively or without some compute
  2. Streaming large amounts of data has implications for egress if the WES API is hosted in the cloud
  3. This feels a lot like it could be serviced by DRS and fills a similar need

Download Each Output File: GET /runs/{run_id}/outputs/{path_to_file} should allow downloading a specific output file.

I have similar concerns as above, but this one may be more permissible, especially if it was serviced by a DRS API instead.

suecharo commented 2 months ago

Hello, @patmagee.

1. Fix ServiceInfo - default_workflow_engine_parameters Type

Thank you for the suggestion on making this change non-breaking. I hadn't considered adding an engine field. However, the default_engine_parameters object itself is defined here: https://github.com/ga4gh/workflow-execution-service-schemas/blob/3a832abeeafeffc1cb8d79045ba0d42b790773d9/openapi/workflow_execution_service.openapi.yaml#L867 , and this would also need to be updated to include an engine field. (though I understand we could handle this with AdditionalProperties).

Anyway, I just wanted to point out that you might have forgotten to update default_engine_parameters. So it would be great if this could be reflected in v2.0.0.

2. Add sort_order and state Query Parameters to GET /runs

The format you proposed,

sort=(<PROPERTY>:)?<DIRECTION>(,(<PROPERTY>:)?<DIRECTION>)

looks great. An example would be something like:

GET /runs?sort=created_at:desc,updated_at:asc

In addition to sorting, I also implemented filtering. Here’s the definition I came up with:

- description: 'Filter the runs based on the state (e.g., "COMPLETE", "RUNNING", etc.).'
  in: query
  name: state
  required: false
  schema:
    anyOf:
    - $ref: '#/components/schemas/State'
    - type: 'null'

It would be great if you could consider this as well.

3. Enhancing POST /runs to Support Download Workflow Attachments

What would be the use case vs just embedding the URL in the inputs and relying on the engine pulling down the file?

For workflows like CWL, where remote URLs are included in the workflow_parameter as the location, and the workflow engine (e.g., cwltool) downloads the file, there is no need for using workflow_attachment.

However, this behavior is entirely engine-dependent, and since Sapporo aims to support multiple engines, we frequently used workflow attachments to handle this. Futhermore, sending files as form data was often costly, so we implemented and proposed this kind of object for usability.

For workflows like CWL, where remote URLs can be included in the workflow_parameter as the location, and the workflow engine (e.g., cwltool) downloads the file, there's no need for using workflow_attachment. However, since this behavior is engine-dependent and Sapporo aims to support multiple engines, we often use workflow attachments to handle this. Additionally, sending files as form data can be quite costly, so we implemented and proposed this object format for better usability.

What is the expectation around authorization/authentication with the files?

Currently, we only assume public URIs.

That said, we're exploring OpenID Connect (OIDC) with WES and S3 storage. In short, we're testing a scenario where an access token obtained from an OIDC provider (e.g., Keycloak) is used to authenticate WES requests via an Authorization Header, and the same token is used to access OIDC-integrated S3 storage (e.g., MinIO).

Even if we integrate with a DRS layer, it's challenging to handle everything through just a workflow_url. Having a mechanism to enumerate workflow attachments in an object format like this could still be beneficial.

4. Enable Downloading of Run Outputs

Yes, I understand the difficulties in officially adopting this into the WES spec (but it does become necessary when implementing WES, doesn't it?).

Is this output downloading the outputs.json returned by the outputs endpoint or a zip containing the actual bytes of the outputs?

Yes, we are zipping the actual files under the outputs directory and sending them as a stream. I understand your concerns here as well. We're hoping that as the DRS layer matures, these issues will be addressed. For example, after execution, the outputs could be uploaded to S3 (DRS), and GET /runs/{run_id}/outputs could return URLs pointing to the uploaded files.