common-workflow-language / common-workflow-language

Repository for the CWL standards. Use https://cwl.discourse.group/ for support 😊
https://www.commonwl.org
Apache License 2.0
1.45k stars 198 forks source link

proposal for "splittable" feature #446

Open mr-c opened 7 years ago

mr-c commented 7 years ago

Scenario: A tool can work on a subset of regular input, but the input is not already available as multiple files (and therefore scatter would not be useful) One can manually add steps to split the input prior to the tool in question (and then scattering over these split files) and combine the outputs afterwards, but that pollutes your workflow with optimization details.

Executors ignorant of this feature will process normally, though more slowly. Splittable aware executors can optimize the split "out of band" or perhaps through autotuning. They also can re-use previously cached results even if the splitting threshold was different (as long as all other aspects are identical)

Proposed syntax via example: InterProScan.cwl:

cwlVersion: v1.0
# this feature is backwards compatible, can be implemented via Polyfill for vanilla v1.0
class: CommandLineTool # or any other type of Process (Workflow, ExpressionTool)

hints:
  Splittable:
    proteins:  # the name of the input that is splittable, repeat for each splittable input
      splitter: fasta_chunker.cwl  # relative path or public HTTP[S] IRI
      combiner:  concatenate.cwl  # relative path or public HTTP[S] IRI
      default: 10000  # default maximum units (records, fields, characters, etc.) per shard

inputs:
  proteins: File  # most likely a file, but could be a complex data type too

outputs:
  any_name_but_only_one: File  # only single output for now
# …

Example splitter:

cwlVersion: v1.0
class: CommandLineTool  # or any other type of Process (Workflow, ExpressionTool)

inputs:
  input: File  # always named input
  threshold: int  # always named threshold

outputs:
  shards: File[] # always of type T[] where T is the type of the input, always named shards
# …

Example combiner:

cwlVersion: v1.0
class: CommandLineTool  # or any other type of Process (Workflow, ExpressionTool)

inputs:
  shards: File[] # always of type T[] where T is the type of the output, always named shards

outputs:
  combined: File  # always named combined
# …

Proposed specification for overrides in user input object (other out of band configuration would also be acceptable)

cwl:tool: InterProScan.cwl  # optional
cwl:split_overrides: 
    proteins:  5000  # overriding the default threshold
cwl:tool: a_big_workflow.cwl
cwl:split_overrides:
  functional_analysis:   # this step in a_big_workflow.cwl runs InterProScan.cwl
    proteins:
      splitter: my_better_splitter.cwl  # overriding the splitter

Combining must be done in the same order of splitting (for now)

Questions for all:

mr-c commented 7 years ago

[added an introduction in response to a useful question from @psafont, made explicit the order of combining in response to a query from @stain]

stain commented 7 years ago

It's a nice proposal, particularly as the splitter and combiner can be arbitrary CWL tools (with certain fixed inputs/outputs).

I think we can support multiple outputs by having Combinable as a separate section.

I don't think it makes sense to have default: 10000 per port, as the number needs to be the same (or some kind of multiple) across all split inputs. Also default is confusing..

hints:
  Splittable:
    split:
      # the name of the input that is splittable, repeat for each splittable input
      proteins:  fasta_chunker.cwl 
    at: 10000
  Combinable:
    combine:
      # the name of the output to combine, repeat for each combinable output
      alignments:  concatenate.cwl
      reports:  concatenate.cwl
    order: any

Here the ports are listed under split, with other Splittable options neighbouring. default changed to at - here the threshold needs to be the same for all inputs.

I'm not suggesting to add order: any now (e.g. allowing combination in any order), but having also Combinable's port listing a level down allow you to add such options later.

SolomonShorser-OICR commented 7 years ago

This could definitely simplify some of the things I've been working on!

Although I have a question about this:

  default: 10000  # default maximum units (records, fields, characters, etc.) per shard

What exactly does this mean? The way it reads (to me), it seems as if the input size is 50000, then there will be 5 shards, each processing an input of size 10000. Would it make more sense to have a field to set a maximum number of shards? If these shards are implemented as parallel threads, I usually have a better idea of how many threads should be running on a given system rather than how big the input to each thread should be (especially when I don't know how big the input will be in advance, but I do know that the hardware can only support a maximum of n operations of a specific type, in parallel). Either way, I think the name default is not all that clear. Maybe chunkSize for input size per shard, or shardCount for number of shards?

Also, why is it a "hint"? Maybe hints work differently in CWL (I haven't worked with them at all yet, so I have no real experience), but in other languages I've worked with, anything that's called a "hint" is something you can request, but the runtime may choose to ignore.

mr-c commented 7 years ago

Howdy @SolomonShorser-OICR

Here's an example of what we're trying to replace

https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/blob/master/workflows/functional_analysis.cwl#L37

Splitter: https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/blob/master/tools/fasta_chunker.cwl Combiner: https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/blob/master/tools/concatenate.cwl

The main reason to split an input (File in this case) is to enable running on multiple machines, though you are right, same machine parallelism can sometimes make sense too.

In some sense, the default field (or at field, following @stain 's suggestion) could be named number instead. It is some numerical knob that makes sense to the splitter and that has been preset by the author (but is still adjustable)

So you could swap out fasta_chunker.cwl for your own fasta_even_splitter.cwl and set the number/default/at to $(runtime.cores)

The reason it is a hint is to maintain backwards compatibility with plain CWL v1.0 systems -- ignoring this won't break a workflow.

SolomonShorser-OICR commented 7 years ago

Ok, I think I see. I mostly find myself splitting across arrays of things (usually files, sometimes complex records), rather than lines of files.

I guess the value of the numeric knob is to be interpreted by the splitter/combiner that is being called?

mr-c commented 7 years ago

@SolomonShorser-OICR This could be used to split arrays into subarrays, would that be useful to you?

The value of the numeric knob is specific to the splitter (only), yes

SolomonShorser-OICR commented 7 years ago

@mr-c Yes, splitting that way might be useful to me.