Open suecharo opened 4 months ago
Hey @suecharo thank you so much for contributing these ideas! Changes that have been made to support real world use cases are highly valuable as they tend to represent a real gap in the specification.
Ah yes, I can see we totally missed this in the implementation of #182, or at least combined the engine and the version into a single string. It also looks like we do not communicate the default_engine
in the service info.
In regards to your specific proposal the changes to the schema would constitute a breaking change and would necessitate a bump to WDL v2.0.0
. While I am generally a proponent for moving to a 2.0 (after a suitable amount of research on current gaps), I think it is likely 1+ years away.
I would modify what you have proposed to be non-breaking. Maybe something like:
{
"default_engine_parameters": [
{ "name": "--cpu", "value": "4", "engine": "{engine_id/verison}" }
]
}
This is logical to me (and something we also have implemented on our own WES engine). Although I would tweak a few of the recommendations
Sort Order: Add a sort_order query parameter to sort runs based on start_time in either ascending or descending order. The value should be "asc" or "desc", with the default being "desc".
WE have gone through various iterations of a field like this on our API, and have settled on a specification that looks like the following:
sort=(<PROPERTY>:)?<DIRECTION>(,(<PROPERTY>:)?<DIRECTION>)
This is more complex, definitely. IT filled the use case where we wanted to sort by multiple columns in multiple different directions. if ONLY the <DIRECTION>
was defined, we default to start_time
as the assumed property, otherwise we allow root keys in the RunLog
object. (we also extended to allow sorting by tags but that is way more complex).
Would you be open to a format like that? We could simplify for now to be:
sort=(<PROPERTY>:)?<DIRECTION>
<PROPERTY>
- optional. the sort key to use. currently only start_time
is officially required. If the property is omitted, it is assumed that the response should be sorted by the start_itme
in descending order<DIRCTION>
- required. The direction, one of asc
or desc
defaulting to desc
A few questions around this, especially given that the original workflow_attachments
was a concession and not ever desired to keep in the spec long term
I definitely understand where the need to download outputs came, (we encountered a similar problem)
Retrieve Outputs List: GET /runs/{run_id}/outputs should return the outputs (outputs field of the response from GET /runs/{run_id}) in JSON format.
This makes sense and I think is relatively easy to implement.
Download Outputs as ZIP: GET /runs/{run_id}/outputs?download=true should return the outputs as a ZIP file.
Is this output downloading the outputs.json
returned by the outputs endpoint or a zip containing the actual bytes of the outputs. If this this is a zip of the bytes of the outputs, I do not think it would be a good idea to include this in the spec in-general, at least not as a required endpoint (despite totally understanding the need). I have multiple concerns:
Download Each Output File: GET /runs/{run_id}/outputs/{path_to_file} should allow downloading a specific output file.
I have similar concerns as above, but this one may be more permissible, especially if it was serviced by a DRS API instead.
Hello, @patmagee.
Thank you for the suggestion on making this change non-breaking. I hadn't considered adding an engine
field.
However, the default_engine_parameters
object itself is defined here: https://github.com/ga4gh/workflow-execution-service-schemas/blob/3a832abeeafeffc1cb8d79045ba0d42b790773d9/openapi/workflow_execution_service.openapi.yaml#L867 , and this would also need to be updated to include an engine
field. (though I understand we could handle this with AdditionalProperties).
Anyway, I just wanted to point out that you might have forgotten to update default_engine_parameters
.
So it would be great if this could be reflected in v2.0.0
.
The format you proposed,
sort=(<PROPERTY>:)?<DIRECTION>(,(<PROPERTY>:)?<DIRECTION>)
looks great. An example would be something like:
GET /runs?sort=created_at:desc,updated_at:asc
In addition to sorting, I also implemented filtering. Here’s the definition I came up with:
- description: 'Filter the runs based on the state (e.g., "COMPLETE", "RUNNING", etc.).'
in: query
name: state
required: false
schema:
anyOf:
- $ref: '#/components/schemas/State'
- type: 'null'
It would be great if you could consider this as well.
What would be the use case vs just embedding the URL in the inputs and relying on the engine pulling down the file?
For workflows like CWL, where remote URLs are included in the workflow_parameter
as the location, and the workflow engine (e.g., cwltool) downloads the file, there is no need for using workflow_attachment
.
However, this behavior is entirely engine-dependent, and since Sapporo aims to support multiple engines, we frequently used workflow attachments to handle this. Futhermore, sending files as form data was often costly, so we implemented and proposed this kind of object for usability.
For workflows like CWL, where remote URLs can be included in the workflow_parameter
as the location, and the workflow engine (e.g., cwltool) downloads the file, there's no need for using workflow_attachment.
However, since this behavior is engine-dependent and Sapporo aims to support multiple engines, we often use workflow attachments to handle this.
Additionally, sending files as form data can be quite costly, so we implemented and proposed this object format for better usability.
What is the expectation around authorization/authentication with the files?
Currently, we only assume public URIs.
That said, we're exploring OpenID Connect (OIDC) with WES and S3 storage. In short, we're testing a scenario where an access token obtained from an OIDC provider (e.g., Keycloak) is used to authenticate WES requests via an Authorization Header, and the same token is used to access OIDC-integrated S3 storage (e.g., MinIO).
Even if we integrate with a DRS layer, it's challenging to handle everything through just a workflow_url. Having a mechanism to enumerate workflow attachments in an object format like this could still be beneficial.
Yes, I understand the difficulties in officially adopting this into the WES spec (but it does become necessary when implementing WES, doesn't it?).
Is this output downloading the outputs.json returned by the outputs endpoint or a zip containing the actual bytes of the outputs?
Yes, we are zipping the actual files under the outputs directory and sending them as a stream. I understand your concerns here as well. We're hoping that as the DRS layer matures, these issues will be addressed. For example, after execution, the outputs could be uploaded to S3 (DRS), and GET /runs/{run_id}/outputs
could return URLs pointing to the uploaded files.
We have completely rewritten Sapporo, a WES implementation, and released it as version 2.0.0. This new Sapporo is based on WES 1.1.0 and is implemented using Python's FastAPI. We first defined types based on WES 1.1.0, as seen in sapporo/schemas.py, and built the functionalities from there. Our goal was to avoid extending the original WES whenever possible. However, there were some necessary extensions that we implemented, which are summarized in this issue. We would be delighted if these extensions could be integrated into the original WES.
1. Fix
ServiceInfo
-default_workflow_engine_parameters
TypeList[DefaultWorkflowEngineParameter]
Dict[str, List[DefaultWorkflowEngineParameter]
Since WES 1.1.0 now supports multiple workflow engines within a single WES instance, this extension was necessary. The expected structure is as follows (example):
2. Add
sort_order
andstate
Query Parameters toGET /runs
To enhance the
GET /runs
endpoint, we propose introducing query parameters for sorting and filtering runs:sort_order
query parameter to sort runs based onstart_time
in either ascending or descending order. The value should be "asc" or "desc", with the default being "desc".state
query parameter to filter runs based on their state, such as "COMPLETE", "RUNNING", etc.3. Enhancing
POST /runs
to Support Download Workflow AttachmentsCurrently,
workflow_attachment
only supports attaching files via form-data. To improve its flexibility and functionality, we suggest the following enhancements:workflow_attachment_obj
that includes a download URL. This would allow WES to download the attachment file at runtime and stage the file. An example object structure would be:4. Enable Downloading of Run Outputs
We propose adding endpoints to download run outputs in various formats:
GET /runs/{run_id}/outputs
should return the outputs (outputs
field of the response fromGET /runs/{run_id}
) in JSON format.GET /runs/{run_id}/outputs?download=true
should return the outputs as a ZIP file.GET /runs/{run_id}/outputs/{path_to_file}
should allow downloading a specific output file.