broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
996 stars 361 forks source link

streaming between tasks #3454

Open notestaff opened 6 years ago

notestaff commented 6 years ago

If output of task A is read by task B, and both tasks support streaming, then they can be run in parallel, rather than sequentially, using a named pipe to stream A's output to B's input (or a socket if run on different machines). Streamable file params would be designated via param_metadata, as in dxWDL: https://github.com/dnanexus/dxWDL/blob/master/doc/ExpertOptions.md#extensions Then chains of tasks could effectively become one task, reducing the overall workflow runtime. This could also be implemented by a standalone WDL-to-WDL rewriter, which would identify chains of mergeable tasks in a workflow and auto-generate a new task that runs the original tasks in parallel.

mr-c commented 6 years ago

Outputs and inputs in CWL can also be marked streamable: true and therefore could also take advantage of this functionality, if added

illusional commented 5 years ago

I'm late to the party on this, but:

Then chains of tasks could effectively become one task

I don't think merging of tasks works if you have certain resource or software dependencies, eg: inside a docker container. From a software engineering POV, is it easy / possible to detect and facilitate streaming between tasks like this, especially if they're scheduled as completely separate jobs? To me it sounds super difficult, like you'd have like a "fuzzy" dependency graph, and you could end up streaming your result data between nodes or tasks (and even worse if you're running on the cloud). (@mr-c, you've talked about this a few times).

parallel, rather than sequentially

Mostly, but what happens if two of the inputs are technically streamable, or even more complicated how would stdin fit into this. The CWL documentation says that it requires the path (eg: $(inputs.stdinRef.path)) which to me sounds like it isn't exactly streamable, but stdout implicitly is? WDL in the version 1.0 spec doesn't include any reference to 'stream', so I'm surprised to see the DNAnexus adding a separate tagging mechanism for this optimisation.

Late edit: reformatting for clarity Engine support:

But piping (named and anonymous) is super easy in WDL because you have a command line, and in CWL you're best bet is using the ShellCommandRequirement + shellQuote: false. Only catch is your docker must have a superset of the software and hardware requirements.

In addition to the links spread out around this post, there's a bit of discussion:

Edit: Quotes + Grammar

geoffjentry commented 5 years ago

Cromwell itself doesn't implement streaming for tasks in either WDL or CWL, mostly. The one exception, depending on how broadly one wants to define streaming, is Cromwell can understand a WDL "hint" (in quotes as there's not an official concept) that a cloud native path in a File variable can be handed directly to the task instead of converted to a local file on the POSIX filesystem. This is using the same sort of markup that DNANexus is using in the task metadata of the WDL.

mr-c commented 5 years ago

In CWL v1.1 we clarify that stdin is also streamable

mr-c commented 3 years ago

but what happens if two of the inputs are technically streamable,

One can used named pipes to implement this (likewise for multiple streamable outputs)

mr-c commented 3 years ago

Engine support:

For streaming to/from the data storage system, the Arvados Keep data system means that the Arvados Crunch workflow manager doesn't have to wait for input files to be staged (copied) in. The Arvados Keep FUSE plugin only downloads data as the tool requests access to a particular offset. I don't think they co-schedule tasks (either on the same system or "nearby" nodes) for direct streaming yet