Open notestaff opened 6 years ago
Outputs and inputs in CWL can also be marked streamable: true
and therefore could also take advantage of this functionality, if added
I'm late to the party on this, but:
Then chains of tasks could effectively become one task
I don't think merging of tasks works if you have certain resource or software dependencies, eg: inside a docker container. From a software engineering POV, is it easy / possible to detect and facilitate streaming between tasks like this, especially if they're scheduled as completely separate jobs? To me it sounds super difficult, like you'd have like a "fuzzy" dependency graph, and you could end up streaming your result data between nodes or tasks (and even worse if you're running on the cloud). (@mr-c, you've talked about this a few times).
parallel, rather than sequentially
Mostly, but what happens if two of the inputs are technically streamable, or even more complicated how would stdin
fit into this. The CWL documentation says that it requires the path (eg: $(inputs.stdinRef.path)
) which to me sounds like it isn't exactly streamable, but stdout
implicitly is? WDL in the version 1.0 spec doesn't include any reference to 'stream', so I'm surprised to see the DNAnexus adding a separate tagging mechanism for this optimisation.
Late edit: reformatting for clarity Engine support:
But piping (named and anonymous) is super easy in WDL because you have a command line, and in CWL you're best bet is using the ShellCommandRequirement + shellQuote: false. Only catch is your docker must have a superset of the software and hardware requirements.
In addition to the links spread out around this post, there's a bit of discussion:
Edit: Quotes + Grammar
Cromwell itself doesn't implement streaming for tasks in either WDL or CWL, mostly. The one exception, depending on how broadly one wants to define streaming, is Cromwell can understand a WDL "hint" (in quotes as there's not an official concept) that a cloud native path in a File
variable can be handed directly to the task instead of converted to a local file on the POSIX filesystem. This is using the same sort of markup that DNANexus is using in the task metadata of the WDL.
In CWL v1.1 we clarify that stdin
is also streamable
but what happens if two of the inputs are technically streamable,
One can used named pipes to implement this (likewise for multiple streamable outputs)
Engine support:
For streaming to/from the data storage system, the Arvados Keep data system means that the Arvados Crunch workflow manager doesn't have to wait for input files to be staged (copied) in. The Arvados Keep FUSE plugin only downloads data as the tool requests access to a particular offset. I don't think they co-schedule tasks (either on the same system or "nearby" nodes) for direct streaming yet
If output of task A is read by task B, and both tasks support streaming, then they can be run in parallel, rather than sequentially, using a named pipe to stream A's output to B's input (or a socket if run on different machines). Streamable file params would be designated via param_metadata, as in dxWDL: https://github.com/dnanexus/dxWDL/blob/master/doc/ExpertOptions.md#extensions Then chains of tasks could effectively become one task, reducing the overall workflow runtime. This could also be implemented by a standalone WDL-to-WDL rewriter, which would identify chains of mergeable tasks in a workflow and auto-generate a new task that runs the original tasks in parallel.