ga4gh / workflow-execution-service-schemas

The WES API is a standard way to run and manage portable workflows.
Apache License 2.0
82 stars 38 forks source link

Input and Output Format Specification #176

Open patmagee opened 2 years ago

patmagee commented 2 years ago

As the WES specification gains traction and there are more adopters, it is becoming apparent that interoperability between WES implementations is a real challenge. Initiatives like FASP have highlighted the numerous challenges around interacting with a given WES API.

One of the key challenges, is the format of workflow inputs and outputs. Presently, there is no common definition for these fields in the WES specification, meaning it is up to each engine / language to dictate the format that inputs should be represented. Effectively, this requires any tooling built to interact with WES to either 1) make assumptions about the representation and only work with a specific language / engine 2) create custom logic for each implementation. This approach does not scale and does not build a strong community centred around an interoperable standard

it is clear that we need to either reduce the divergence between the different engines, or provide a solution that mitigates it.

State of the WES inputs and Output

Inputs

workflow inputs are submitted as part of the workflow_params form parameter, and there is no restriction on what type of values they can represent. It is up to the user to know how to define these values for the given language they are using.

The situation is made a little bit more complex however by the fact that the RunRequest object implicitly expects inputs to be in a JSON coercible format.

Outputs

Workflow outputs are returned by the engine as arbitrary JSON in the RunLog, therefore it is currently required that outputs must be represented as JSON. Since there is no imposed structure other then JSON each engine is free to implement the outputs section as they see fit, leading to divergence between engines.

Proposals

I have outlined a number of proposals that immediately come to mind, however I would encourage anyone to come up with their own proposal, or ammend mine as well.

Proposal 1: Define Abstract input and output format

The WES API should prescribe the format of all inputs and outputs in a language agnostic way. This would require all implementations to support the exact same input/output format and internally translate to whatever format is necessary for their supported languages.

Defining a singular format is not without it's challenges. Different languages have different needs, and their input formats largely reflect that. Capturing all of the nuances into a single meta abstraction would be a challenge to do without loosing information in the process.

Given that most (but not all) workflow languages support some sort of rudimentary type system, one approach may be to define a super set of types common across all languages. This likely is not possible, and means that the WES api would not necessarily be able to represent all types, or new types. It also implies that there are specific languages that WES supports, and any other language is out of scope.

Another possibility is to not use types at all but simply require a specific format (ie JSON for the inputs). This is a step better then where we were at from before, however without the ability to define HOW those JSON inputs are structured

Pros

Cons

Proposal 2: WES API Should be self Describing for a Common Serialization Format

The WES API should provide required endpoints to describe a supported workflow in a machine readable way which could then be used by an end user to encode inputs in an expected format (Also to extract outputs in the expected format as well)

This pull request already captures the basic premise of this proposal. Essentially need to be able to represent the workflow inputs, outputs, and any other information (engine params) in some sort of schema. The PR suggest the use of JSON shcema, which makes sense of decide to chose JSON as the common serialization format, however other formats could be used (avro?)

By allowing each engine to be self describing, we provide a substantial amount of flexibility to the implementors. Differences in how each engine represents things like keys, files etc become less important when you have a schema defining all of the options available. Additionally, tooling could easily be made to understand how to interact with the schema for any workflow or workflow language.

Pros

Related Issues

adamnovak commented 2 years ago

We're running into some issues at UCSC related to this problem in the wild, with Dockstore, WDL workflows, and different WES endpoints that purport to run them.

WDL explicitly takes its input as a JSON object, so some runners just directly treat workflow_params as that input JSON object and feed it to the workflow. Other runner implementations treat workflow_params as a runner-specific structured object describing where to find the JSON input file that the workflow takes, usually from among the request's attachments. (This is useful if the input is going to be large, for example.) Just knowing that you have a WES endpoint that runs WDL isn't enough here; the client has to understand whether and how to wrap or attach its input JSON object to pass it through workflow_params to the workflow.

If the spec wasn't silent on this, we could write one client and have it work the same way against everything that runs WDL on WES.

patmagee commented 2 years ago

@adamnovak thanks for the input! I was not 100% sure if we were the only ones running into this issue, so it is nice to hear that others are experiencing the same pain.

Do you have any suggestions? I too have seen the workflow_params used to refer to the actual workflow input file as well. One concern I have with that is it really does not solve the problem. IE you still need to figure out how the inputs are defined out of scope for the API. It also means that any UI or tooling around this see's a reference to the inputs as opposed to the actual inputs which were used

ohofmann commented 2 years ago

I tend to be more in favor of the first proposal - define the inputs / outputs now and make the adjustments instead of creating another model / framework to describe them (which we then need to standardize). From the Australian Genomics perspective I'd rather try to gather the most common input/output types for the most used engines/languages and come to an agreement for those. With that in place future workflow languages also have a template to work with if they want to support WES.

coverbeck commented 2 years ago

Speaking from a Dockstore, WES invoker, perspective, having the language-specific workflow_params makes it easy to move from running locally to running via WES:

To run/test your workflow locally with Dockstore CLI:

dockstore workflow launch --entry wdlworkflow --json localTestWdlInputs.json...

Now when you want to scale up and run it on a WES server:

dockstore workflow launch *wes* --entry wdlworkflow --json seriousWdlinputs.json...

But I suppose our CLI could transform wdlinputs.json to the new format...

This is just one data point, and I do think it makes sense to have a standard spec for inputs/outputs, but I did want to note one of the advantages of the current definition for us.

patmagee commented 2 years ago

@coverbeck does the dockstore tooling generate the inputs file or the scaffolding for an inputs file? I find trying to be build UIs around WES a challenge without some sort of way to describe the inputs that a worklfow actually accepts

coverbeck commented 2 years ago

@patmagee , yes, the Dockstore CLI has an option to generate the scaffolding for an inputs file. We also encourage authors to include their parameter file(s) when they publish their workflows, although that's more for reproducibility and/or verification.

patmagee commented 2 years ago

@coverbeck how do you handle different WES api's which use the same workflow language but have different models for defining inputs and returning outputs?

coverbeck commented 2 years ago

@patmagee , adding @Richard-Hansen who had done most of the hands-on work for Dockstore CLI and WES.

how do you handle different WES api's which use the same workflow language but have different models for defining inputs and returning outputs

I don't think we've really run into this yet, although we're just getting started with our first non-beta release soon -- the WDL WES implementations take a "standard" input JSON. Amazon Genomics Client is slightly different, where you need the workflow params to refer to the input file, but the input file itself is standard WDL input.

Richard-Hansen commented 2 years ago

@patmagee The Dockstore CLI doesn't have any custom infrastructure to handling special input/output scenarios. As discussed above, Dockstore has encountered two separate methods for specifying an input file input.json:

  1. Method 1, the input JSON is passed under workflow_params:
      dockstore workfow wes launch --entry myDockstoreEntry --json input.json
  2. Method 2, a JSON referencing the workflow input file is passed under workflow_params, and the actual input file is an attachment included in workflow_attachments:
      dockstore workfow wes launch --entry myDockstoreEntry --json pointer.json --attach input.json

    where pointer.json specifies which attached file is the input JSON for your workflow. pointer.json may look like:

      {"workflowInputs": "input.json"}

From the standpoint of the Dockstore CLI, no special steps are taken to support either approach. The Dockstore CLI will build an HTTP request which follows the WES schema using the provided input parameters, and it's up to the user to know what the WES server is expecting.


Perhaps a broader knowledge of different workflow languages would benefit me here, would you be able to provide an example where a standard input JSON is insufficient?

My initial inclination is towards a more abstract approach to input/output formatting.

patmagee commented 1 year ago

@coverbeck that is a good example of a different approach. The AGC api requires a specific format for the workflow_params that is not easily communicable through the WES api alone. Once you know what it requires, it is quite easy of course. The way they solved it is also a potential solution here to side-step the "Enforce JSON Inputs everywhere" problem .

@Richard-Hansen So far, the languages I have encountered all have a relatively static configuration for how workflow inputs are defined. there are some caveats of course. For example MiniWDL and cromwell have slightly different input semantics for WDL (Although I think miniwdl allows for the cromwell semantics).

If the workflow_params became an object that point to the workflow inputs instead of the actual workflow inputs (and of course there was a requirement in WES to maintain a reference to the actual inputs via an API) we could just rely on the languages then

patmagee commented 1 year ago

@vinjana I would appreciate your take on this challenge

vinjana commented 1 year ago

Inputs

First concerning the input formats:

I think, the current restriction of the format of workflow_params to JSON should be dropped.

There are just too many formats (YAML, JSON, TSV, CS, XML, CLI parameters, INI, custom formats, etc.). Enforcing JSON via the standard puts unnecessary load on the WES implementors, but also on the WES clients, because they may have to convert their existing configuration files to JSON and understand how the server translates it back into something useful. This is very error prone!

Instead, there should be some workflow_params_format field. In the ServiceInfo this field may list the allowed or the server-checked file formats. The list may also include e.g. * to represent that clients can just upload whatever they want (on their own risk). It could be optional, meaning that all formats are accepted (mostly unchecked, probably).

However, file formats are only a part of the story. In our discussions at the plenary, it seemed to be combined with the question of parameter "types". I think that the term "parameter types" will only hide the complexity of the problem. Different file formats may be unambiguous representations of different schemas. I think that schemas will be necessary to describe the allowed parameter types for each individual workflow or even workflow version (think of a mandatory parameter that becomes optional in a new workflow version, or think of new nesting structures for parameters).

In the end, it is the responsibility of the workflow engine authors and the workflow authors to come up with a standard for this, because the authors will have to write the schemas, and the engine authors will want to support them in doing so. The schemas may then end up in TRS -- completely independent of any WES. Therefore, I think, that we should not work on providing a way to describe workflow parameters (i.e. "type systems", or "schema definition languages", etc.).

However, like for the file formats, it may be interesting to check the uploaded configurations against schemas. The WES operator may want to support the client and provide "schema checks as a service" for uploaded workflows. In this case, there should be a workflow_params_schema (or so) field in the RunRequest. Alternatively, the WES operator may want to enforce specific schemas for managed workflows. For this it would be helpful, that the allowed schema would be reported to client, e.g. via a route that describes the managed workflows.

But, for these use cases, I think, that the communication of workflow schemas via the API (as input for RunRequest, or output for managed workflows), are actually just nice-to-have features. So, I think, it would be perfectly fine, if a WES just would read the schema from the workflow and report validation errors if any occur.

In the moment, I don't see any reason to make file formats or schemas a requirement for a useful WES implementation. Therefore, I would suggest that the fields related to formats and schemas should be optional features in the WES specification.

Outputs

One may think of e.g. reporting file formats/schemas for each file. A use case mentioned in the plenary would be to automatically choose visualisation components.

But also for this, I think that this is ultimately the responsibility of the workflow author to annotate the outputs with the right formats. I am not familiar with the ro-crate specification. Maybe that would be a good way to communicate this information from the workflow to the WES server.

For the RunLog we may think of annotating each output file with a format/schema specification. I don't know, maybe allow to add a list of ontology terms or so. This would also be useful to e.g. describe the semantics of the file. Usually the visual representation of a file depends not only on the file format/schema, but also on the semantics.

mr-c commented 1 year ago

I'm strongly against not having a standardized schema for inputs, as it makes having a generic WES client impossible.

If the WES client is tied to the workflow language/framework, then what is the point of the standardized API? All the pain of a standard with no benefit.

vinjana commented 1 year ago

@mr-c

I'm unsure what you mean by "standardized schema". Do you mean the format, such as JSON? Or do you mean like a JSON schema?

My guess is that you want that whatever format is used, it should be validatible -- so it should have some schema and the config should be in a format like JSON, YAML, etc.. You're are probably just against making the enforcement of a schema optional, or not? In that case it would be fine to e.g. download/upload a scheme file with the workflow.

vinjana commented 1 year ago

My comment assumed that the WES API would enforce as little from the workflow engines and workflows as possible. Therefore I tried to find a solution that satisfies this assumption.

If no format or schema is enforced at all for the workflow_params, this probably would be so for all instances of that workflow or engine also in other WES implementations. Then there would also not be any problem to execute this workflow on one WES or the other.

Except, of course, some WES implementors are not content with this and implement some mapping specifically for this engine or workflow. This would be possible, but lead to the problem that there is no guarantee that all WES implementors that support that engine do the mapping in the same way. And then the clients would have to know details about the WES implementation, which is against the idea of a standard API, right?

This problem, BTW, exists also if the standard enforces some schema for the workflow_params. I fear, it can only be prevented if it would be decided, that only workflows and engines are supported whose parameters can be unambiguously represented by the fixed schema. And the stricter the schema, the more workflows and engines will be excluded or will have to be changed.

A solution with arbitrary schemas would exclude very few workflows/engines, and does not prevent anybody to implement a simple flat key/value/type schema for a workflow engine that does not support more.

uniqueg commented 1 year ago

I think there is clearly a need for describing inputs and outputs in a standardized way, but I also feel that this should not be handled in WES directly, but rather should be a different specification (possibly also in the Cloud WS). Talking to various people over the past months (within and outside of GA4GH), I had the feeling that there is broad agreement on a need for this, and I guess what is really needed is a task force of multiple motivated people trying to come up with a first version. Possibly based on schemas that are already being used in the wild, like the one from nf-core.

WES implementations could then be encouraged or required to parse/interpret this schema and a new file type IO_SCHEMA (or similar) could be added to TRS to be associated with a workflow. Ideally, of course, the schema format should be promoted as good practice by the various workflow language developers (and possibly support for it added natively in the respective engines), so while it is not absolutely essential to have those developers on board, it would be tremendously helpful to make such a schema a success.

Re: your concern of restricting the workflows that can be seamlessly run on WES (and _different WES instances/implementations, no less), @vinjana: such a schema could of course be provided by anyone - they don't have to be written by the workflow devs themselves. NF Tower, for example, allows uploading schemas if they are not already part of the workflow repository (or override existing ones). Considering that it takes quite a few requirements to write a workflow that is truly portable and executable on WES, I don't think having to provide that schema is that much of a hurdle. Plus, the fallback of providing inputs via a JSON string as is currently the case is always there (although with a signficantly worse user experience).

Tagging a couple of people here to see if we can come up with a critical mass to get this going: @mr-c @johanneskoester @pditommaso @simleo @KerstenBreuer @bgruening @adamnovak @tetron @denis-yuen @coverbeck

tetron commented 1 year ago

I agree with this, I proposed exactly something like this for WES several years ago, unfortunately I don't have the presentation from then. It didn't go anywhere, people didn't seem to understand the need.

However, the gist of the proposal was to start from the CWL model, which is already rigorously defined and abstracted from the runtime details:

https://www.commonwl.org/v1.2/Workflow.html#Operation

The CWL input/output schema definition itself descends from Avro, although that isn't particularly relevant here.

It's my understanding that Workflow Hub (https://workflowhub.eu/) intends to use CWL as the internal common abstract model for inputs/outputs/steps, @stain would know more.

uniqueg commented 1 year ago

Great! Starting from the CWL model seems perfectly reasonable to me. We can then see what other solutions could possibly add to this, e.g., nf-core or Snakemake. @KerstenBreuer had also started collecting requirements for such an Input/Output description, which I thought was quite complete. Do you maybe want to share this here?

patmagee commented 1 year ago

@tetron IIRC The CWL team was working on a WDL -> CWL translator? Did you encounter any hiccups with this? Do you know if that type of approach would be conducive to Nextflow or SnakeMake?

uniqueg commented 1 year ago

I'm not sure, but I think there is also some sort of internal conversion of workflows in various languages (possibly including NFL and SMK) to CWL happening in the Workflow Hub. IIRC, this is just for internal representation, so engines may not be able to run the resulting CWL representations, but the tooling may still be a good starting point.

I'm sure @stain or @fbacall would know more. That is, unless I'm not completely mistaken :)

patmagee commented 1 year ago

I wonder if there has been any thought about reusing other GA4GH initiatives such as SchemaBlocks to describe the data representation of inputs. remember its not necessarily the case that we need to be able to have a single meta model of allowed input types, so long as there is a single WAY to describe things using a shared ontology

mr-c commented 1 year ago

@tetron IIRC The CWL team was working on a WDL -> CWL translator? Did you encounter any hiccups with this? Do you know if that type of approach would be conducive to Nextflow or SnakeMake?

Here is a cross walk of CWL types to WDL types: https://github.com/dnanexus/dxCompiler/blob/main/doc/CWL_v1.2.0_to_WDL_v1.md#cwl-to-wdl-type-mappings