Standardizing the `workflow_params` format

mr-c commented 4 years ago

During the "Cloud Work Stream @ GA4GH 8th Plenary 2020" meeting it was proposed that the workflow_params object format be standardized.

I proposed reusing the CWL format, where it is called an "input object".

During the same meeting a participant asked how the WDL input format compares to the CWL input format. I can now share a document that maps CWL concepts to WDL concepts: https://github.com/dnanexus/dxWDL/blob/master/doc/CWL_v1.2.0_to_WDL_v1.md

If this is agreeable, then I would be happy to write up a self contained description of the CWL input object format. It is basically just a JSON dictionary of input names to values. The biggest difference is that File inputs are explicitly marked as such and not just a string.

jaeddy commented 4 years ago

Thanks, @mr-c! The document you authored mapping CWL to WDL concepts is fantastic. Maybe @pditommaso can enlist you to flesh out a similar mapping for Nextflow concepts. 🙂 (and if not, it might still be worth exploring other ways to compensate your effort)

One minor concern I have about the CWL input object being the common format is the emphasis of some Avro types over JSON Schema (e.g., record vs. object). While it's hard to complain about any Apache standard, I'm not sure which format is more common in real world applications. In particular, GA4GH has leaned heavily into OpenAPI, which uses JSON Schema to schematize components, parameters, etc. I don't know if this could lead to tension between API model specification and workflow parameter specification.

On the other hand, JSON Schema might be lacking when it comes to semantics (compared to JSON-LD, SALAD, et al.), which seems pretty valuable in the world of clinical and genomic data operations — I don't know if Avro in particular offers any additional benefits in this regard.

I need to spend more time digesting the doc you shared, but curious to hear your take on some of the JSON world distinctions.

ps - I don't know whether this thread should be tied to WES or a separate Cloud WS standard altogether.

mr-c commented 4 years ago

Thank you @jaeddy for the enthusiasm! Credit goes to DNAnexus for sponsoring that document.

Yes, it would be great to see other mappings laid out side to side.

FYI, CWL is no longer formally bound to Avro, though historically there was a connection. (Which was an attempt to synchronize with what the GA4GH was doing at the time, if I remember correctly).

The CWL standards are living standards. So refinements, clarifications, or even an additional format for specifying inputs is possible in future versions of the CWL standards.

In fact the CWL standard does not require one to use the CWL yaml/json format for inputs; they could be specified by a GUI, XML, binary stream; whatever is accepted by the CWL aware workflow engine. The cwl-runner interface for command line execution of the CWL conformance tests (and used by many users) does require support for the CWL JSON/YAML input object format.

@tetron may have additional insights to the benefits of the CWL/SALAD approach.

ps - I don't know whether this thread should be tied to WES or a separate Cloud WS standard altogether.

Should we move this discussion to a mailing list, or is this issue sufficient for now?

pditommaso commented 4 years ago

Thanks for including in this thread. From my point of view, the WES API should agnostic to any specific params format or semantic to accommodate the needs of any backend.

In regard to the requirement for Nextflow, the workflow_params is used both for the pipeline inputs and for the nextflow runtime parameters and metadata.

If you are inclined to enforce a format for the pipeline inputs IMO it should be an opt-in feature provided by a new attribute eg workflow_inputs. This would also allow maintaining the backward compatibility with the existing WES API definition and implementation.

nsheff commented 3 years ago

Has there been any update on effort to standardize the workflow_params?

mr-c commented 3 years ago

@nsheff It isn't part of any of my funded work, so nothing from my side.

ruchim commented 3 years ago

@nsheff thanks for raising that question. Also adding @mr-c for input.

I'm thinking through the difference between engines that implement CWL, WDL & Nextflow -- as workflow parameters include the workflow script/descriptor, inputs and a bunch of metadata/runtime attributes that range widely. My primary thought is that it would enforce a lot of standardization on engine implementations so it should be balanced with high value for a WES user

Can we talk through scenarios that are significantly improved as a result of having this standardization? Or would not be possible without it?

Thanks!

patmagee commented 3 years ago

@mr-c the proposal to use the CWL standard seems reasonable and more or less should map onto what WDL uses (KV). The only thing I am hesitant about is the explicit inclusion of the class: "File" component for defining files. Looking through WDL and Nextflow docs, it does not appear that either of these have a requirement to type the file inputs as part of their inputs. Every Workflow langauge will likely have difference constraints, will we include the idiosyncrasies of each and not just CWL in order to ensure cross compatibilty?

I wonder if a more restrictive input format of just KV pairs, where all we enforce is the key is an input attribute, runtime attribute or other parameter required by the workflow, and the value is the JSON value needed by. This wouldnt enforce the CWL semantics, but would provide a standard way to define at least the Keys of the inputs. Clients of the WES API would have to be language aware (ie they likely would need to know if they are submitting wdl, cwl or nextflow), and could supply the specific value format dictated by that language.

patmagee commented 3 years ago

Another issue with the cwl syntax for files in regards to wdl specifically, is that the only reliable way to strip the class information is to introspect and parse the wdl document to get the input types. For example take the following two workflows

workflow foo {
  input {
    File inp
  }
}


workflow foo {
  input {
    Map[String,String] inp
  }
}

Both of these will take the input

{
  "foo.inp": {
     "class": "File",
     "path" : "my-file.txt"
  }
}

However, if we try to apply a general heuristic of stripping out objects containing class:File with a path, and replace it with just a path, workflow 2 will not work. The only way to properly modify the inputs is to actually parse the full workflow, get the types of each input, and descend into the JSON tree and identify any FILE type locations and properly handle them. This is potentially error prone and not every wes api (especially if they are not themselves the engine but delegating to somewhere else) will have this capability

This IMO is an unfair burden on non cwl implementations (which will be the majority), which likely will be a blocker for adoption

tetron commented 3 years ago

@patmagee

However, if we try to apply a general heuristic of stripping out objects containing class:File with a path, and replace it with just a path, workflow 2 will not work. The only way to properly modify the inputs is to actually parse the full workflow, get the types of each input, and descend into the JSON tree and identify any FILE type locations and properly handle them. This is potentially error prone and not every wes api (especially if they are not themselves the engine but delegating to somewhere else) will have this capability

I think you have it backwards.

Without some way of indicating that something is a file (e.g. the File object) then it is impossible to distinguish between plain strings and files WITHOUT parsing the full workflow and understanding the input types.

It seems like it needs to be one of:

Use a convention like the CWL File object, which can be manipulated in well-defined ways to convert to the workflow's native parameter input.
Use a plain string for files, but require it to follow some constrained format that ensures it can be reliably recognized as a URI.
Standardize on a lightweight schema to describes input parameter types, that would be extracted from the workflow language's native way of describing the parameters, so that you can infer which parameters are files and which ones are bare strings, and can be used uniformly with CWL/WDL/NextFlow/etc. CWL also provides a corresponding schema language for describing input parameters, but maybe you'd prefer something else.
Decide that distinguishing between files, strings, etc for WES parameters is too hard / doesn't matter and close the ticket.

I don't know how big of a problem it is in practice that you would have a Map[String,String] that looked like a File object and was converted to bare path when you didn't want that to happen. It is a fair objection but also something of a pathological case. In CWL, unbounded mappings like that are pretty unusual, you can declare record type inputs but you have to declare what fields you are expecting (but also File objects stay as objects in the data model.)

patmagee commented 3 years ago

@tetron what I am saying is that the proposal to use the cwl approach will add an ADDITIONAL parse step of the workflow (unless someone is building a new execution engine from the ground up).

Of course in order to execute a workflow any engine will need to parse the provided files and inputs at runtime, however implantation of WES will likely be added onto existing infrastructure. This means that they will mostly translate an incoming WES call into the native one that is supported.

It seems unnecessary to impose an additional parsing here for all non cwl workflows just to support the cwl syntax. Especially when all of the typing information will be readily available within the runtime environment of the engine.

pgrosu commented 3 years ago

Maybe I'm doing something wrong, but it seems the first workflow seems to fail with the above given input:

$ cat path.json
{

  "foo.inp": {
     "class": "File",
     "path" : "my-file.txt"
  }

}

Below is the error:

[2020-12-09 16:10:01,71] [info] WorkflowManagerActor Workflow 29e1908b-d5f5-4e34-b5d4-ff21fb8659d5 failed (during MaterializingWorkflowDescriptorState): cromwell.engine.workflow.lifecycle.materialization.MaterializeWorkflowDescriptorActor$$anon$1: Workflow input processing failed:
Failed to evaluate input 'inp' (reason 1 of 1): No coercion defined from '{"class":"File","path":"my-file.txt"}' of type 'spray.json.JsObject' to 'File'.

I think the womtool is very helpful in this situation auto-select the appropriate input JSON file to use, where one has multiple input JSON files for different workflow type-configurations:

$ java -jar womtool-54.jar inputs path.wdl
{
  "foo.inp": "File"
}

$ cat path.wdl

version 1.0

workflow foo {

    input {
      File inp
    }

}

ga4gh / workflow-execution-service-schemas

Standardizing the `workflow_params` format #161