chanzuckerberg / miniwdl

Workflow Description Language developer tools & local runner
MIT License
177 stars 54 forks source link

lazy file downloading #282

Open mlin opened 5 years ago

mlin commented 5 years ago

Our first-pass logic for workflow file downloading does one pass to download all URIs referenced in the inputs, before proceeding with any other workflow steps. It would be nice to perform the downloads lazily so that any steps not dependent on them can run concurrently. (Remember that 'steps' includes not only task calls, but also evaluation of WDL expressions at the workflow level; e.g. a workflow can read_lines() a File which might be input as a URI.)

Possible 80/20 optimization: exclude from the upfront download pass, any File inputs which are only fed directly into a call input, at most once (& not in a scatter). Then the task runner itself can perform the download. ("at most once" to avoid repeatedly downloading a file used multiple times)

The full enchilada would introduce download operations as workflow graph nodes. The state machine would be liable to issue a download Call whilst visiting a File Decl node. Probably overkill?

mlin commented 4 years ago

Refinement of the above concept: if file download cache is enabled, exclude from upfront download pass, any input URI less than a single node (outside a scatter) that's less than any other nodes consuming the File

mlin commented 4 years ago

wip: https://github.com/chanzuckerberg/miniwdl/pull/417/commits/d3fd41b49f6b58ebb25ab3634767d4d5641823e3 but it's more complex than hoped for