common-workflow-language / common-workflow-language

Repository for the CWL standards. Use https://cwl.discourse.group/ for support 😊
https://www.commonwl.org
Apache License 2.0
1.45k stars 198 forks source link

input value restrictions / validations #764

Open mr-c opened 6 years ago

mr-c commented 6 years ago

CWL restrictions

In this Biostars question, we started discussing an extension of the CWL specification (inspired by CTD) to allow restricting the values that an input parameter can take. After discussing among a number of interested people, we would now like to further discuss this issue publically to get feedback and arrive at a consensus before settling on the syntax and implementing reference validation routines.

Contributors

Alexander Kanitz Julianus Pfeuffer Timo Sachsenberg Rene Rahn Michael R. Crusoe Peter Amstutz

More contributors are very welcome, please add yourself to the list!

First of all, what is the motivation here?

The ability to specify which values the parameter of Command Line Tools (or Workflows) can take improves tool handling and documentation in several regards.

It improves documentation of tool regarding sensible parameter ranges. It opens the way for automated GUI generation for worfklow/tool configuration (e.g., drop down boxes to select one/multiple valid values, combo boxes etc.). It allows for validation of arguments before runtime (which is often delayed due to batch system queues or other runtime setup), thus (i) potentially saving time by catching errors at a very early stage with (ii) uniform and informative error messages that do not needlessly expose the interiors of the given Command Line Tool. In certain settings (eg. web server) this would further allow a reduced risk of code injection and an improvement of usability because end users could be effectively prevented from entering wrong values.

What about compatibility with current CWL tools, workflows and engines?

Specifying value restrictions is (and should probably remain) optional. Validation of values should also, at least, initially be optional and may be altogether ignored by some engines. At a later date the community could choose to make it required in a future revision of the CWL standards. In any case, parameters for which restrictions are not explicitly specified should always pass any validations.

Consumers of CWL that choose to make use of the optional restrictions feature will be encouraged to backwards propagate them for a better user experience. This means that a restriction on a input for a particular tool is implicitly also on the corresponding Workflow level input if that input is not used anywhere else. But if a workflow level input is connected to more than one tool, it has to pass the restrictions for all of the tools (e.g. restrictions are tool-, not workflow-specific).

Will there be any new types?

Probably not. Regardless of how restrictions are encoded and validated, the types associated with each parameter remain the same. However, validation (if implemented) needs to ensure that the restrictions are compatible with the type of the parameter (e.g. regular expressions can only be used with strings, and intervals can only be used with numeric types).

Which restrictions are to be supported?

string: Regular expressions, string sets(?) int/long: Integer intervals, integer sequences(?), integer sets(?) float/double: (Real) intervals, integer intervals, real sets(?)

Note that multiple validation specifications are allowed for each input. They are to be combined in an 'OR’ manner, passing one of them is good enough.

Which restrictions will not be supported (at this point)?

Complex and irrational numbers (we think this type would need to be implemented first) Complex series (in the future we could have an extension point to an external validator like series on oeis.org)

What about other types?

http://www.commonwl.org/v1.0/CommandLineTool.html#CWLType http://www.commonwl.org/v1.0/CommandLineTool.html#CommandInputRecordSchema http://www.commonwl.org/v1.0/CommandLineTool.html#CommandInputArraySchema

For example, a minimum size for a file? Or size of an array? How about restrictions on the values of a Record? Or on the structure of a Directory? (Must have N files, must have subdirectory named “foo”, …)

How to encode the new validation specs?

All encodings will follow YAML v1.2. We were not sure about how concise we can be with the hierarchies/object structure in YAML so we include a short notation and a long one for every parameter type in the proposal and hope to find a consensus on one.

Examples (concise syntax proposal)

Real interval allowed values: -2 OR any number between 0 (inclusive) and 3 (exclusive) OR any number greater than or equal to 6

inputs:
  fooparam:
    type: double
arguments:
  - valueFrom: $(inputs.fooparam)
    prefix: -A
    position: 1
restrictions:
  fooparam:
    - !interval: “[0,3)” # Use YAML tags to enable parsing of custom types
      # If disambiguation according to that one field is possible,
      # the “!” can be removed and it becomes a field
    - !interval: “[6,)”
    - -2

Integer interval allowed values: one of -3, -2, 0, 1, 2 OR any integer greater than or equal to 6

inputs:
  fooparam:
    type: int
arguments:
  - valueFrom: $(inputs.fooparam)
    prefix: -A
    position: 1
restrictions:
  fooparam:
    - !intInterval: “0..2”
    - !intInterval: “6..”
    - -2 # this would be an integer scalar
    - -3 # another

Integer interval plus single values array allowed values: one of -3, -2, 0, 1, 2 OR any integer greater than or equal to 6

inputs:
  fooparam:
    type: int
arguments:
  - valueFrom: $(inputs.fooparam)
    prefix: -A
    position: 1
restrictions:
  fooparam:
    - !intInterval: “0..2”
    - !intInterval: “6..”
    - [-2, -3]

Mixed interval allowed values: any real number between 0 and 1 (exclusive) OR 0 OR any non-negative integer

inputs:
  fooparam:
    type: double
arguments:
  - valueFrom: $(inputs.fooparam)
    prefix: -A
    position: 1
restrictions:
  fooparam:
    - !interval: “(0,1)”
    - !intInterval: “0..”

Integer sequence allowed values: all odd integers

inputs:
  fooparam:
    type: double
arguments:
  - valueFrom: $(inputs.fooparam)
    prefix: -A
    position: 1
restrictions:
  fooparam:
    - !intSequence: “https://oeis.org/A005408”

RegEx

inputs:
  fooparam:
    type: string
arguments:
  - valueFrom: $(inputs.fooparam)
    prefix: -A
    position: 1
restrictions:
  fooparam:
    - foo
    - bar
    - !regex: “.*baz.+\.xml”

RegEx plus single values array

inputs:
  fooparam:
    type: string
arguments:
  - valueFrom: $(inputs.fooparam)
    prefix: -A
    position: 1
restrictions:
  fooparam:
    - [foo, bar]
    - !regex: “.*baz.+\.xml”

Examples (explicit syntax proposal)

Real interval

inputs:
  fooparam:
    type: double
arguments:
  - valueFrom: $(inputs.fooparam)
    prefix: -A
    position: 1
restrictions:
  fooparam:
    - class: interval       
      #class optional, disambiguation given fields should be possible
      low: 0.0          #default is .-inf 
      low_inclusive: true       #default is true
      high: 3.0         #default is .inf
      high_inclusive: false #default is true
    - class: interval
      low: 6.0
      low_inclusive: true
      high: .inf            # positive infinity in YAML syntax
      # .inf implies inclusive=false, else error
      high_inclusive: false
    - class: interval
      low: -2.0
      low_inclusive: true
      high: -2.0   
      high_inclusive: true
      # if any of the bounds are exclusive -> low and high must be different

Integer interval

inputs:
  fooparam:
    type: int
arguments:
  - valueFrom: $(inputs.fooparam)
    prefix: -A
    position: 1
restrictions:
  fooparam:
    - class: intInterval
      low: 0
      high: 2
    - class: intInterval
      low: 6.0
      high: .inf            # positive infinity in YAML syntax
    - class: intInterval
      low: -2.0             #default is .-inf  
      high: -2.0            #default is .inf

Mixed interval

inputs:
  fooparam:
    type: double
arguments:
  - valueFrom: $(inputs.fooparam)
    prefix: -A
    position: 1
restrictions:
  fooparam:
    - class: interval
      low: 0.0
      low_inclusive: false
      high: 1.0
      high_inclusive: false
    - class: intInterval
      low: 0
      high: .inf                 # positive infinity in YAML syntax

Integer sequence allowed values: all odd integers

inputs:
  fooparam:
    type: double
arguments:
  - valueFrom: $(inputs.fooparam)
    prefix: -A
    position: 1
restrictions:
  fooparam:
    - class: intSequence
      oeisId: “https://oeis.org/A005408”

RegEx

inputs:
  fooparam:
    type: string
arguments:
  - valueFrom: $(inputs.fooparam)
    prefix: -A
    position: 1
restrictions:
  fooparam:
    - class: regex
      pattern: “foo”
    - class: regex
      pattern: “bar”
    - class: regex
      Pattern: “.*baz.+\.xml”

For the concise version we would need to work with something like YAML tags and arrays of mixed types. Is this possible? Sets instead of arrays would of course also be possible, if not preferred. Replace “-” with “?” then.

Where to put the specs?

We suggest to put them in http://www.commonwl.org/v1.0/CommandLineTool.html#CommandLineTool as a new top level section “restrictions” with a specification similar to the following:

restrictions map<input_name: array<string | int | float | double | range<int | float | double> | regex | Expression > > False (Only applicable for string, int, long, float and double parameters) List of restrictions that apply to Parameter input_name. A given parameter value should be accepted if any of the restrictions match the input. Restrictions can be single values (appl. to all param. types), ranges of the same type (appl. to int, float and double), regular expressions (for string parameters) or Expressions (if an expression is provided, the expression must return a boolean representing a match). If the parameter is an array type, every value has to match at least one of the restrictions. We hope for some input on the “only applicable for” dilemma. Single values could be omitted in favor of intervals with same boundaries or regexes without “wildcards” in exchange for more verbose files. See e.g. our explicit examples and the open questions section below.

Trial period

There’s a couple of ways restrictions could be trialed before going into a CWL spec, either using a new ProcessRequirement to hold the “restrictions” field, or by using a namespaced extension field.

As a requirement:

inputs:
  fooparam:
    type: string
hints:
  ParameterRestrictions:
    restrictions:
      fooparam: …

As an extension field (note: this makes it possible to specify restrictions to array elements and record fields):

inputs:
  fooparam:
    type: string
    rst:restrictions:
      fooparam: …

Open questions

How to name the concept, new top-level section etc? In this document we have used “restrictions”, but “constraints”, “validations” and “allowed values” have also been proposed. Should the restrictions be a part of the type definition, or in a separate section? Should the concise or explicit be used? Or perhaps a mixed form, e.g. with several fields per element (better for extension) but shorthand notation for intervals etc.? Must YAML tags be used to specify restriction classes (as in the concise notation examples) or can the types be deduced? Should sets of allowed numbers and words be represented with the interval, intInterval and regex classes as in the examples (clearer specs but much more verbose) or should there be an extra class (perhaps using CWL array notation)? Should string restrictions replace the Enum type (at least for this usage scenario)? Currently, for strings you can set the list of allowed values using Enums. See http://www.commonwl.org/user_guide/19-custom-types/ and http://www.commonwl.org/v1.0/CommandLineTool.html#CommandInputEnumSchema Should CWL-supported JavaScript Expressions be allowed/supported? Expressions of e.g. the form $(5 + x % 2 == 0) where x represents the input value and the expression returns a boolean could be very useful for automatic validation as well as representing sequences (see below). Note, the use of Expressions to implement this concept is already possible using valueFrom to test, returning self if the test passes, or throwing a Javascript Exception if it does not. Should integer sequences be supported (with extra optional validation)? One possibility is to specify an identifier of the Online Encyclopedia of Integer Sequences (https://oeis.org/). Additionally, CWL-supported JavaScript Expressions might be used both for representing and automatic validation of integer sequences. Which regular expression implementation should be used? So far the following have been proposed: POSIX-compliant Extended Regular Expressions (POSIX, but not implemented in most languages) and PCRE (used by built-in regex engines of Python, Ruby, PHP, JavaScript and others). We should stay with only one flavor now and choose one that is easily accessible in a range of possible validation engines/languages. Should we support the negation individual (and/or the whole set of) restrictions? E.g. a regex ’.keyword.’ would allow the user to supply any string except for any that contain the substring “keyword”. Should we then also support stringing together individual restrictions in an ‘AND’ rather than in an ‘OR’ manner? Backwards propagation of restrictions to the Workflow level is desirable for better user experience. But how to deal with Workflow level inputs that are connected to two or more inputs (of the same tool or different tools) with different restrictions? Where to implement validations in the CWL reference implementation? Probably somewhere in schema_salad/validate.py

== Peter == Some notes:

This should be part of schema salad. The CWL parameter definition inherits from base schema salad types, and schema salad code is responsible for document validation, including checking input objects. So the implementation would go into schema_salad/validate.py

Note that as long as the set of types restrictions is bounded, we don’t need to specify “class”, the validator can disambiguate just based on which fields are provided. Integer range restrictions (minimum/maximum/step size) should be usable on array types, and apply to the length of the array. The syntax should also make it possible to apply restrictions on the items of the array as well:

Applies to items:

inputs:
  fooparam:
    type: 
      type: array
      items: string
      rst:restrictions:
        fooparam: …

Applies to array size:

inputs:
  fooparam:
    type: 
      type: array
      items: string
    rst:restrictions:
      fooparam: …

If you haven’t already, consult XML-Schema and JSON-Schema for ideas about what other kinds of constraints might be useful. I suggest not having Javascript expressions as part of the restrictions spec. However, it might make sense to add a “validator” field to the CWL Process object (the supertype of CommandLineTool/Workflow) which allows the use of an arbitrary Javascript expression to validate the full input object. This would provide a way to check for complex dependencies between inputs which can’t be easily expressed through simple value restrictions.

mr-c commented 6 years ago

Originally draft: https://docs.google.com/document/d/1-F8CtU7Em0hwnhVjIsdHtVGqpzhKm3dpuXOCpKy9SRQ/edit

ghost commented 5 years ago

@bogdang989

mr-c commented 4 years ago

For those who want this feature now, here is a sketch of a workaround: https://github.com/common-workflow-language/common-workflow-language/issues/907#issuecomment-653743239

fmigneault commented 4 years ago

Personally, I would much rather have explicit fields as proposed here over the "hackish" inline code of https://github.com/common-workflow-language/common-workflow-language/issues/907#issuecomment-653743239. So +1 for this kind of feature.

My favorite approach would be closer to the "explicit syntax", but I feel restrictions should really be nested under the respective (e.g.: same level as type) since these restrictions apply for the relevant input. I don't see the advantage of having a separate restrictions section, as one would need to go back-and-forth between it and the inputs section. It is error prone if input name is modified, and forgotten in the other section, so better to just have them together in the first place.

multimeric commented 3 years ago

Some quick thoughts:

bogdang989 commented 3 years ago

To chime in with our previous thinking (from a few years ago I think) if it is useful, Seven Bridges considered this as an internal hint for start as well, but never got to implementing it yet. The main value is to prevent executions that are destined to fail of course, thus saving time, money and most importantly nerves of the person running the workflow :) I'll share some use-cases we listed that are captured from users and would provide value today in real workflows:

Examples
- Input A value must be less than Input B value
- If Input A or Input B is set then both A and B must be set
- Parameters A and B cannot be both set
- If input file A is set then parameter B must be set
- Input list A must contain exactly 4 elements
- Input integer A must be in range 0-255
- Input file A does not have sample_id metadata field
- Input file A must have extension .bam
- Input file A too big, increase disk space/change instance type/modify resource requirements
- Input file A very small, reduce disk space/use smaller instance/modify resource requirements
- Input string A does not match an expected RegEX
- Return any custom object for an error (or simply return a string for error). What to return for no error (null, false, something else?)?

and this is the mock syntax that was considered and how all the use cases listed above could be covered.

hints:
- class: sbg:validationExpression
  errorConditions:
  - ${ if (inputs.min_value > inputs.max_value) return "min_value must be <= max_value"}
  - ${ if (inputs.reads.length != 2) return "Provide exactly 2 files for reads"}
  - ${ if (inputs.bam.nameext != '.bam') return "Please insert BAM file on BAM input"}
  - ${ if ( (inputs.min_value && !inputs.max_value) || (!inputs.min_value && inputs.max_value) ) return "Inputs min_value and max_value must be both set!"}
  - ${ if (inputs.a && inputs.b) return "You can't set both a and b. Please choose one!"}
  - ${ if (!inputs.a.metadata.sample_id) return "Set metadata field sample_id for input a."}
  - ${ if (inputs.a.size > 1000000000) return "Input a too big, increase disk."}
  - ${ if (inputs.a.size > 1000000000) return {"severity": "warning", "message": "Input a too big, increase disk."}}
  - ${ if (!/[a-d]/gi.test(inputs.a)) return "Input a does not match regex /[a-d]/gi."}

To provide reasoning behind decisions made for this approach:

multimeric commented 3 years ago

JavaScript validations are certainly an interesting and powerful angle to take. My only concern is that we probably want to make CWL as declarative as possible, which pushes us towards a data schema instead.

ghost commented 3 years ago

@TMiguelT I have an opinion re: putting in a mini-DSL in our DSL to take care of input validations vs leaving that to JS.

While we may succeed in developing a DSL that covers some of the use cases, some will always be out of reach. As we implement more and more use cases we will, to bring up an old adage, end up with a poorly specified, ad hoc implementation of LISP.

Since we already allow JS, I would rather that each input is allowed an optional validator JS expression which has access to the whole input object and returns a boolean True or False that indicates a valid or invalid input.

This is a simpler addition to the spec, reuses a component workflow engines are already implementing and adds a very powerful and much asked for functionality to the language.

Thanks.

fmigneault commented 3 years ago

I have another reason why declarative CWL schemas are better than JS. Not all users of CWL are developers, which is prone to errors and had yet another learning curve defining good processes. We want to make definition of processes as easy as possible, and when we add JS in the mix, the easiest approach for people that just want it to work is to simply omit this kind of validation as it is too complicated and troublesome to define. On the other hand, a field such as minValue or a list of choices are very natural to interpret and easy to define. The user doesn't need to worry about correct resolution and correct use of < vs <= everywhere.

tetron commented 3 years ago

Introducing a new mini-DSL (the "concise syntax" proposal) adds parsing complexity but also lacks expressiveness, so I don't think we want to go that route.

I started toying with a version of the "explicit syntax" proposal:

https://github.com/common-workflow-language/schema_salad/compare/constraints

The benefit of a declarative syntax is that it is easier to introspect and translate to other contexts, for example rendering a form for an input value, you can use min/max values to render a slider.

I was looking at JSON Schema validation as suggested by @TMiguelT and it has quite a few options. We could copy the capabilities wholesale. One thing I like is their idea of string "formats" where you can semantically express that a field matching a certain regex is a date, or an email address, etc.

This might also be an opportunity to fix up how type definitions work so that it is cleaner and more consistent to define a common type to use across a set of Workflows.

However I agree that Javascript is way more powerful, and the CWL philosophy has always been to prefer declarative structures but include Javascript as a "trap door" so that more complicated things are possible.

What if we split the difference, and did something where schema salad gets declarative constraints on individual fields, and CWL gets a expression hook for Process-level validation in Javascript? Then you can still check for things that are not possible with declarative constraints, as well as check dependencies between fields that may require special logic.

mr-c commented 3 years ago

For those who need this today, one could add an CWL expression to a unused in entry at the workflow step level via valueFrom that returned nothing unless there was a validation error, then by throwing an exception it would cause the step to fail.

ghost commented 3 years ago

@mr-c 😄 a colleague of ours uses this exact thing. I think he has a validator step that fails. This still means some processing and staging of files (which can be expensive) has to be done. I don't think there is any way to get around staging files since a validation may involve checking file sizes etc.

tetron commented 3 years ago

Or your workflow could include an expression tool that checks values and either passes through or throws an error. Same idea.

So we should be clear on how the proposed feature will be an improvement over the existing workaround. Because we could also just make the workaround official and tell people to use it.

multimeric commented 3 years ago

I wouldn't have any issue with supporting both JSON Schema and a JS validation expression. Does this require voting on by the Leadership Team?

tetron commented 3 years ago

Eventually, but we need a proposal and an implementation and a PR against the next version of the standard first.

multimeric commented 3 years ago

So what is the mechanism for deciding which implementation to pursue, before we sink time into making a proposal that might be rejected?

tetron commented 3 years ago

Well, the purpose of having an experimental implementation is to be able to evaluate it and make changes. You have to start somewhere.

mr-c commented 3 years ago

So what is the mechanism for deciding which implementation to pursue, before we sink time into making a proposal that might be rejected?

This issue is that mechanism in action. If one got a lot of pushback for a particular proposal, one might not put in the effort to make a minimal implementation.

I personally favor explicit constructs as they aid in generating better error messages and (G)UI interfaces. In fact I have been hired on a part time basis by https://www.denbi.de/network/center-for-integrative-bioinformatics-cibi and one of my tasks is to implement the features of CTD that are not yet in CWL for evaluation as future parts of the CWL standards. And this proposal is part of that.

fmigneault commented 3 years ago

I agree with @tetron about the "explicit syntax" being better than "concise syntax". Since CWL brings an abstraction of how to validate and run the process, there shouldn't be any extra parsing of the fields to interpret them. Having post-processing of the fields leads to more divergences between implementations that "support CWL". This goes along with the parsing of JS expressions which I find an horrible language dependency where none is needed for JSON/YAML-like definitions.

Also, I don't see the benefice of having the developer define specific "allowed-values" requirements using mathematical representation (eg: [0, 10[ to represent min: 0 & min_inclusive: True + max: 10 & max_inclusive: False), to then have CWL do the inverse parsing of that representation. Having both explicit/concise representations would also causes ambiguities, such as how to interpret: [0, 10[ & max_inclusive: True (which one takes priority?).