broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
997 stars 361 forks source link

improve caching scheme to include removal of intermediate output/input objects #4064

Open mohawkTrail opened 6 years ago

mohawkTrail commented 6 years ago

I would like to allow caching, although the size of the saved execution folder makes that prohibitive. I am suggesting a scheme in which large intermediate objects (files) can be removed by Cromwell at our suggestion, yet permit intact caching.

In this scheme it would be possible to mark a task output as "too big to keep, please remove when no longer referenced" (or something more pithy). The object would be left in the execution folder until after its last mention and then removed (or at the very end of the workflow).

Caching would then need to be modified to "bracket" the first task in which the object is mentioned as an output and the last task for which it is an input. This group of tasks would be considered as a group. The group of tasks would be skipped if the inputs to the first task and the outputs from the last task are all cache hits. (In case the group is not linear but some other DAG the scheme could be abandoned, or else perhaps the method would generalize with more complicated rules--I'm not sure.)

Following is a motivation of the need for this feature, and sequence of arguments for why it can work.

We have a certain amount of input and output objects that we keep in their own buckets. The outputs are copied out of the execution folder after the workflow completes. These files we cache by reference, and that works fine, and we want to keep the inputs and outputs forever. However we have a large volume of intermediate files which end up in our cromwell-executions bucket.

We love caching. It works great. A fully cached workflow runs in about 5 minutes at next to no cost. Fresh workflows (no cache hits) cost on the order of $0.50 for typical examples, and run for a few hours.

Object storage has been eating us up, though. We've worked out that for a single one of these workflows the break even point at which it's cheaper to rerun it than to save it and cache it is about a week. If you take into account that we re-run workflows only a small part of the time, it probably doesn't even pay to keep the execution folders at all (except in the intangible wall clock time).

[And nearline / coldline makes no sense at all. Each cached file is accessed multiple times which makes cached runs way way more expensive than fresh runs.]

We’ve examined the pipeline, and we see that we could reduce the size of intermediate outputs, from 126G to 40G by combining separate tasks, which obviates the need to make the large file an output of the first task and input to the second. This leads me to a question for the deep thinkers in Cromwell caching.

I want to ask if something makes sense in theory, for the purpose of making caching more feasible for us.

Suppose I took the two tasks I spoke of, one of which “passes” a large file to the second, and made them into a sub-workflow. And I mark the large files as “too big to keep” so they Cromwell would strip them out of the execution folder after the run completed. If caching were to work by looking at the inputs and outputs of the sub-workflow, and not at each task one by one, then it would be possible to cache the entire sub-workflow. Right?

Let’s say this sounds theoretically possible. Wouldn’t it be possible then, to skip making an actual sub-workflow at all to bracket the trashed intermediates, but just have clever Cromwell analyse the execution tree and the presence of “too big to keep” vacancies and allow the same optimization automatically.?

mohawkTrail commented 6 years ago

Oh, this part is not essential to the basic suggestion, but since we put our outputs someplace else after the workflow finishes, it would also be nice not to keep them in the execution folder, but instead refer to their new locations for cache eligibility purposes.

Not essential, because even just getting rid of our multiple large intermediate files would make caching more feasible.