common-workflow-language / common-workflow-language

Repository for the CWL standards. Use https://cwl.discourse.group/ for support 😊
https://www.commonwl.org
Apache License 2.0
1.45k stars 198 forks source link

proposal: inline parameters references #107

Closed yvdriess closed 8 years ago

yvdriess commented 9 years ago

Observation:

A common pattern in non-trivial CWL workflow and tool specifications is the need to refer to some type of workflow-, tool- or job level parameter. Today, the provided example workflows seem achieve this by using the expression engine. Some examples:

outputs:
  - id: "#bam"
    type: File
    outputBinding:
      glob:
        engine: "node-engine.cwl"
        script: >
        $job.output_prefix + '.aligned.bam'
outputs:
  - id: "#bam"
    type: File
    outputBinding:
      glob:
        engine: "node-engine.cwl"
        script: >
        $job.input_name + '.aligned.bam'
  stdout:
    engine: cwl:JsonPointer
    script: /job/echo-in-outputfile
outputs:
  - id: "#output_file"
    type: File
    outputBinding:
      glob:
        engine: "cwl:JsonPointer"
        script: "job/output_name"

Problem:

The most obvious (although harmless) issue of the above code is the verbosity. One needs to specify an expression engine, provide a script and point to an object in said script. The script special-case specifying an object with '/' syntax can be confusing, but is not a problem in itself.

The problems with the above examples are twofold: separation of concerns and introducing unnecessary and dangerous dependencies. From the point of view of separation of concerns, the expression engine was added to do more complex processing work 'inline'. Using this feature to simply fetch a string in a parameter appears to be an elaborate workaround at best and a dangerous unintended use at worse. The dangerous part is that such scripts, while fulfilling a simple purpose, introduce a dependency on a specific expression engine (e.g. bash version, ECMA version). A dependency that is unnecessary given what it wants to achieve: e.g. 'give me the name of the output file'. Moreover, in the batching example the tool description relies with the script that '#output_prefix' exists in the context of the workflow. This is dangerous both because tools are assumed to be workflow-agnostic and because the this assumption cannot be statically checked by the CWL implementation (as this reference is embedded in the script code).

In short, the frequent use of expressions in CWL tool- and workflow specifications is a bad smell that points to a possible missing feature which forces users to use the expression engine functionality as a workaround.

Solution:

Add a simple way to refer to job-level objects in the CWL specification syntax. I propose to only allow String types to be referred, as this seems to be sufficient for most of the above observed examples. Moreover, it allows for simple string-substitution and embedding of the string object in the CWL specification syntax. However, there are some cases where referring to File objects might also help to work away the superfluous expression engine uses.

outputBinding:
glob: "#{input_name}.aligned.bam"
outputs:
  - id: "#bam"
    type: File
    outputBinding:
      glob: "#{output_prefix}.aligned.bam"

NB. Please suggest alternatives and point out mistakes in the above!

dleehr commented 9 years ago

I agree that the current state is very verbose, and should be more concise for simple use cases. I think the variable substitution/preprocessing is a good approach for simple cases, and leaving more complicated cases (e.g. changing extensions) to an expression engine / scripting engine.

Another idea would be to enumerate the most common cases, (assembling a file name or string, using an input string for an output name), and support these directly. Like how a command-line is assembled from inputs, outputs, and baseCommand, a file name could be assembled from a similar tree.

In a similar project, I defined a handful of custom YAML tags that do just this. While this approach isn't aligned with the intent of YAML tags (Serializing custom data structures), the four operations (join, change_ext, base, and interpret_var) have covered my use cases so far.

tetron commented 9 years ago

I am not opposed to this in principal, but want to be very careful with the slippery slope. If we add support for variable substitution, someone will want array indexing, arithmetic, macros, functions, classes, etc. Each feature raises the burden for implementing CWL, and one of the goals is to make things as easy as possible for platform developers (because CWL won't be portable if no one implements it).

Another suggestion that had come up is to remove expressions entirely, except for input references (and/or a substitution syntax like the one proposed). Complex uses of expressions might be better implemented as upstream work that feeds into the "real" job.

tetron commented 9 years ago

As a side note, yaml annotations are a little esoteric (they are definitely not party of the JSON compatible subset of yaml) but they do map somewhat onto the RDF notion of extensible data types. We could use tags to indicate a string is should be evaluated as an expression, this may be less verbose than the current notation.

tetron commented 9 years ago

Here is an example of the slippery slope. Concatenating strings is fine and easy, but what if you want to strip off and replace a file extension with another? Now you need string manipulation functions like "substr" or "basename" or the substitution syntax is too weak to even handle it's intended use cases.

yvdriess commented 9 years ago

I recognise the possible slippery slope. However, there is a need for some light-weight manipulations at the specification level. As was observed, a simple substitution might already cover the majority of the cases. Another suggestion that I heard was to use inline bash within the CWL specification. This would also solve the issue without having worry about balancing implementation work vs expressiveness of a custom syntax. Important then is to scope the bash expressions such that they are only used to form String datatypes and that after preprocessing the YAML/JSON-LD, nothing of the bash stuff is left in the flat JSON CWL workflow file. This way, execution engines (e.g. precipes) can focus on a flat CWL specification and use more complete implementations (e.g. reference implementation or Rabix) to do the preprocessing.

I can get behind the suggestion to remove expressions entirely, as it does seem that their uses are either something trivial as in the observations above, or complex enough to be considered as external/upstream work.

tetron commented 9 years ago

Much of the power of shell scripts relies on calling traditional Unix tools for text manipulation, so this would either create a huge backdoor in the system (through calling arbitrary tools) and/or still be too weak for many use cases.

One option might be to identify a very limited subset of an existing language (such as JavaScript) and specify that, such that it is tractable to implement a standalone expression interpreter or delegate to a full implementation of the language. This approach would explicitly disallow complex constructs like function definition.

tetron commented 9 years ago

After giving it more thought, here's my proposal.

Add inline/string interpolated expressions with the following forms:

  1. $(javascript expression)
  2. ${javascript block}

The above examples:

outputBinding:
  glob: "$(job.input_name).aligned.bam"
outputs:
  - id: "#bam"
    type: File
    outputBinding:
      glob: "$(job.output_prefix).aligned.bam"

The second form ${} allows statements and must end in return. In this example, add the argument "--gzip" if the input file ends in ".gz":

arguments:
  - valueFrom: ${if (job.file.endswith(".gz") { return "--gzip"; } else { return null; } }

In this example, go through the list of input files and separate the files ending in ".abc" and ".xyz"

arguments:
  - valueFrom: >
      ${
        var a = [];
        var b = [];
        for (var i = 0;  i < job.files.length; i++) {
          if (job.files[i].path.endswith(".abc") {
            a.add(job.files[i]);
          }
          if (job.files[i].path.endswith(".xyz") {
            b.add(job.files[i]);
          }
        }
        return ["--abc"] + a + ["--xyz"] + b;
      }

Proposed changes:

  1. Add InlineJavascriptRequirement to enable feature
  2. Inline expressions can be used anywhere expressions are currently specified
  3. Deprecate (or remove) pluggable expression engines

Intended benefits of this proposal:

ghost commented 9 years ago

Would inline expressions always return a string? If the expression is supposed to yield something else (array, int, etc), do we just use the regular class: Expression struct?

For example, would the following values in stdin, stdout, baseCommand, glob and valueFrom fields evaluate like this:

$(1) -> "1" $(null) -> "" $([1,2])x -> "1,2x" $({a:1}) -> "[Object object]"

yvdriess commented 9 years ago

+1 for the @tetron proposal.

Although, I have some questions around the 'job' object: How well specified are the context of this job object? How much of it is known 'statically' at preprocessing time? How much of can only be known/filled in at runtime, during workflow execution.

For the sake of the discussion, imagine a CWL specification which has two type of Expressions: 'static' (compile-time) and 'dynamic' (run-time) expressions. Static expressions are guaranteed to be executable during a preprocessing step. The execution environment does not see such expressions, only the resulting string (or object?). Static expressions naturally can only use elements of 'job' that are known at preprocessing time (e.g. bound workflow parameters). Dynamic expressions are those that need context/runtime information, thus cannot be expanded beforehand. For instance, the 'output_prefix' element is only known during workflow execution.

The big question here is: can such dynamic expressions all be expanded at the start of the workflow execution? 'start' As in the moment the CWL binary is invoked with concrete parameters. Or, can such dynamic expressions have a dependency on data that is only known by the execution of other steps in the workflow. This question is related to @tetron 's third benefit:

Deprecating or removing pluggable expression engines eliminates them as an open-ended computational backdoor in the workflow; anything that can't easily be expressed by an expression probably ought to be an upstream workflow step.

It is easy to see that static expressions cannot really be such backdoors or abused as inline computational steps. Dynamic expressions that can be expanded at program start are also less abusable, as they are not part of the workflow dependency graph. Dynamic expressions that do need context/job information that is only available at execution time are the ones that can be abused.

tetron commented 9 years ago

@ntijanic what do you think of this: inline expressions may be embedded in a string or not, and that controls whether the return value is coerced into a string or keeps its return type.

foo: $(1+1) 
 ->
foo: 2 (number)

foo: " $(1+1) "
 ->
foo: " 2 " (string)

foo: $(String(1+1))
 ->
foo: "2" (string)
tetron commented 9 years ago

@yvdriess I'm not entirely sure I understand the distinction between "static" and "dynamic" expressions. The reason for expressions is that some decisions are late binding and can't be made until the workflow is already running. You're correct that there's a subclass of expressions can can be computed immediately without actually executing any jobs, such as deterministically generating output names from input names, which would be useful for workflows like "make" where you specify the desired outcome of the workflow (build program X) and follow the graph backwards to as to only perform the necessary work instead of imperatively executing the entire workflow from initial input. This is an interesting question that we haven't fully addressed.

ghost commented 9 years ago

inline expressions may be embedded in a string or not, and that controls whether the return value is coerced into a string or keeps its return type.

Trying to understand this, but failing miserably :)

Kind of by definition, inline expressions are embedded in a string. Can you give a real example of a non-embedded expression?

It doesn't seem much overhead to write e.g. {class: JSExpression, expr: 1+2} if someone really wants an integer result. Most of the time, this will be used as string anyway (command line fragments, glob, file names, etc), so the equivalent $(1+2) can be used.

For most use cases, I think it makes sense to add two more rules:

For example, as discussed briefly on gitter:

inputs:
 - id: inp
   type: ["null", "boolean"]
   inputBinding:
     prefix: -p
     valueFrom: $($self===null?null:($self?1:0))  # [null, true, false] -> [null, 1, 0]

The above would add "-p 1" or "-p 0" or nothing, depending on input value.

ghost commented 9 years ago

@tetron Perhaps you mean something like "If a string contains only an embedded expression, with no leading/trailing characters, then the result is not coerced?

This might work, but not sure if it's needed at all. Also, it would be good to use an existing templating syntax so we don't have to document yet another one. And if we do, the extra rules around it might make it hard to use existing libs for the templating engine.

tetron commented 9 years ago

@ntijanic yes that's what I was trying to say. If there are no leading/trailing characters, then do not coerce the result to string (this is pretty easy because YAML syntax tends to trim leading/trailing whitespace). The reasoning for proposing this rule is to be able to kill off the more verbose expression syntax once and for all.

When you say "template engine", do you mean templates like this?

<% for (var i = 0; i < 5; i++) { %>
  Hello world!
<% } %>

I'm not proposing general templates. My idea for inline expressions is for simpler string interpolation/substitution semantics. This would not include the above example.

The distinction between $() and ${} is whether it is an expression or a block:

$(1+1) -> function(){ return (1+1); }() ${ return 1+1; } -> function(){ return 1+1; }()

So for the above you would write something like:

${
  var s = ""; 
  for (var i = 0; i < 5; i++) {
    s += "Hello world!\n";
  }
  return s;
}

What do you think?

ghost commented 9 years ago

Yeah, I mostly had in mind underscore's templating, which is basically just interpolation with options to html-escape the result. We don't need the html-escape, but might want to have shell-escape if we add the shell feature.

I like the general idea a lot. If we make the expression syntax part of the core spec, we can do away with json pointers too. I'm just not sure about $(), ${} and coercion rules.

tetron commented 9 years ago

Request for comments on branch inline-expressions

tetron commented 9 years ago

So, having implemented inline expressions, I'm looking back to @yvdriess original request, which was just for inline parameters and not full expressions. I think it would make sense to have a distinct feature that has the same behavior as full inline expressions but only for Javascript value references; basically a small compatible subset of the proposed inline expressions. The following examples are all the same:

requirements:
 - class: InlineParameterRequirement

$($job.file.path) 
$($job['file'].path)
$($job.file['path'])
$($job['file']['path'])

It should be straightforward to match inline parameters using just a regular expression, because nested parens/braces/brackets are not a concern. Example (untested, certainly incomplete):

$\($job(\.[a-zA-Z][a-zA-Z0-9]*|\['[^']+'\])+\)
tetron commented 8 years ago

Merged.