Open mlin opened 5 years ago
Refinement of the above concept: if file download cache is enabled, exclude from upfront download pass, any input URI less than a single node (outside a scatter) that's less than any other nodes consuming the File
wip: https://github.com/chanzuckerberg/miniwdl/pull/417/commits/d3fd41b49f6b58ebb25ab3634767d4d5641823e3 but it's more complex than hoped for
Our first-pass logic for workflow file downloading does one pass to download all URIs referenced in the inputs, before proceeding with any other workflow steps. It would be nice to perform the downloads lazily so that any steps not dependent on them can run concurrently. (Remember that 'steps' includes not only task calls, but also evaluation of WDL expressions at the workflow level; e.g. a workflow can
read_lines()
aFile
which might be input as a URI.)Possible 80/20 optimization: exclude from the upfront download pass, any File inputs which are only fed directly into a call input, at most once (& not in a scatter). Then the task runner itself can perform the download. ("at most once" to avoid repeatedly downloading a file used multiple times)
The full enchilada would introduce download operations as workflow graph nodes. The state machine would be liable to issue a download Call whilst visiting a File Decl node. Probably overkill?