Open mr-c opened 7 years ago
[added an introduction in response to a useful question from @psafont, made explicit the order of combining in response to a query from @stain]
It's a nice proposal, particularly as the splitter and combiner can be arbitrary CWL tools (with certain fixed inputs/outputs).
I think we can support multiple outputs by having Combinable
as a separate section.
I don't think it makes sense to have default: 10000
per port, as the number needs to be the same (or some kind of multiple) across all split inputs. Also default
is confusing..
hints:
Splittable:
split:
# the name of the input that is splittable, repeat for each splittable input
proteins: fasta_chunker.cwl
at: 10000
Combinable:
combine:
# the name of the output to combine, repeat for each combinable output
alignments: concatenate.cwl
reports: concatenate.cwl
order: any
Here the ports are listed under split
, with other Splittable
options neighbouring. default
changed to at
- here the threshold needs to be the same for all inputs.
I'm not suggesting to add order: any
now (e.g. allowing combination in any order), but having also Combinable
's port listing a level down allow you to add such options later.
This could definitely simplify some of the things I've been working on!
Although I have a question about this:
default: 10000 # default maximum units (records, fields, characters, etc.) per shard
What exactly does this mean? The way it reads (to me), it seems as if the input size is 50000, then there will be 5 shards, each processing an input of size 10000. Would it make more sense to have a field to set a maximum number of shards? If these shards are implemented as parallel threads, I usually have a better idea of how many threads should be running on a given system rather than how big the input to each thread should be (especially when I don't know how big the input will be in advance, but I do know that the hardware can only support a maximum of n operations of a specific type, in parallel).
Either way, I think the name default
is not all that clear. Maybe chunkSize
for input size per shard, or shardCount
for number of shards?
Also, why is it a "hint"? Maybe hints work differently in CWL (I haven't worked with them at all yet, so I have no real experience), but in other languages I've worked with, anything that's called a "hint" is something you can request, but the runtime may choose to ignore.
Howdy @SolomonShorser-OICR
Here's an example of what we're trying to replace
Splitter: https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/blob/master/tools/fasta_chunker.cwl Combiner: https://github.com/ProteinsWebTeam/ebi-metagenomics-cwl/blob/master/tools/concatenate.cwl
The main reason to split an input (File
in this case) is to enable running on multiple machines, though you are right, same machine parallelism can sometimes make sense too.
In some sense, the default
field (or at
field, following @stain 's suggestion) could be named number
instead. It is some numerical knob that makes sense to the splitter and that has been preset by the author (but is still adjustable)
So you could swap out fasta_chunker.cwl
for your own fasta_even_splitter.cwl
and set the number
/default
/at
to $(runtime.cores)
The reason it is a hint is to maintain backwards compatibility with plain CWL v1.0
systems -- ignoring this won't break a workflow.
Ok, I think I see. I mostly find myself splitting across arrays of things (usually files, sometimes complex records), rather than lines of files.
I guess the value of the numeric knob is to be interpreted by the splitter/combiner that is being called?
@SolomonShorser-OICR This could be used to split arrays into subarrays, would that be useful to you?
The value of the numeric knob is specific to the splitter (only), yes
@mr-c Yes, splitting that way might be useful to me.
Scenario: A tool can work on a subset of regular input, but the input is not already available as multiple files (and therefore
scatter
would not be useful) One can manually add steps to split the input prior to the tool in question (and thenscatter
ing over these split files) and combine the outputs afterwards, but that pollutes your workflow with optimization details.Executors ignorant of this feature will process normally, though more slowly.
Splittable
aware executors can optimize the split "out of band" or perhaps through autotuning. They also can re-use previously cached results even if the splitting threshold was different (as long as all other aspects are identical)Proposed syntax via example:
InterProScan.cwl
:Example splitter:
Example combiner:
Proposed specification for overrides in user input object (other out of band configuration would also be acceptable)
Combining must be done in the same order of splitting (for now)
Questions for all:
Splittable
,splitter
,combiner
,default
,input
,threshold
,shards
,combined
,cwl:split_overrides
Splittable
processes with multiple outputs?