chanzuckerberg / miniwdl

Workflow Description Language developer tools & local runner
MIT License
173 stars 54 forks source link

Ideas for the WDL runtime #166

Open rhpvorderman opened 5 years ago

rhpvorderman commented 5 years ago

Dear @mlin,

Thanks for your great work so far. I have been thinking myself about combining miniwdl and snakemake this summer and see if I can make a WDL runtime. But now it turns out you also started on a WDL runtime. I would like to share my ideas so far, maybe it will help.

  1. Cromwell's design choice of using execution folders may be great for cloud (I assume it is, I do not have experience with running cromwell in the cloud) but it is a pain for any shared filesystem backend, including local running. Lots of files are duplicated and it creates headaches with indexes of reference genomes, because these files need to be in the same folder. Cromwell is constantly moving files (or hardlinks, or softlinks) around. This hinders speed and makes the cromwell code base more complex. In contrast snakemake does not do this. It refers to inputs in-place. This has some advantages:

    • No copying, moving or linking files around
      • No headaches with indexes
      • No blowing up of a execution folder in terms of enormous disk space usage
    • Able to delete temporary outputs
    • Simpler model, simpler code
    • disadvantage: Files can be overwritten if input and output paths are not set correctly. This is very easy to workaround of course.
  2. One of the things I was dreading was how to calculate graph, but Miniwdl already solves this. Another thing I was dreading was writing all the required stuff to run jobs locally and on a cluster. But Snakemake solves this. It has already written code to run any sort of job on any backend. So a lot of time can be saved reusing this code instead of inventing the wheel again.

  3. Cromwell's choice to use a MySQL database for call-caching is really unfortunate. It requires users to set up a MySQL database to make use of this feature. Which is a bit of a pain. If you choose not to do this, cromwell will use an in-memory database. Cromwell's memory footprint is HUGE on long-running jobs without a database. This makes it a real pain to use on the cluster without a MySQL database. Snakemake determines the progress in a graph by looking which files are there. I don't think this is possible with WDL as a language as there are variable scatters and conditions. If a database backend is implemented it should be SQLite by default.

I was planning to spent a week during the summer seeing how far I can come with these ideas, but since you are already planning to make a WDL runtime I think this time is better spent by trying to improve the WDL runtime in miniwdl. Feel free to kick some tasks to me!

Best regards, Ruben Vorderman

mlin commented 5 years ago

@rhpvorderman Thank you for this! Our initial goals for the runner are to create something pretty simple for execution on the local host, but importantly (i) supports the OpenWDL community with an accessible codebase for hacking/experimenting and (ii) has a very modular construction so that parts of it can be reused heavily for WDL support in many backends/platforms. So, if we do this right, I think your summer project to hook it into Snakemake would remain eminently feasible and valuable :smile: I'd compare the initial scope to cwltool which plays such a role for the CWL ecosystem.

I've certainly also experienced the challenges with Cromwell copying and linking large files. For several years, reliance on a shared filesystem (NFS etc.) was kind of an anti-pattern for the major public clouds, so it's understandable that Cromwell is oriented to copying files around (because you'd have to be transferring them to and from object storage anyway). Nowadays though the clouds all have easy and performant shared filesystem services, which have really only matured in the past couple of years. It will be interesting to explore how these might simplify the file localization issues that cloud workflow runners have otherwise been designed around. Not something we'll get to right away here, but it's in the back of my mind during the build. (For example, the incipient task runtime uses some docker features to make input files read-only to the task, so that it can't clobber them.)

So far I haven't had a chance to work out the model for serializing the AST and workflow state, which bears on distributed execution and call caching as you say. Python pickling is really convenient, but opaque.

Again I can't overstate how excited we are that miniwdl is getting to a functional enough state for community contributors to start taking an interest! The project kanban has all of our tactical items and I'm trying to groom some of them as Starter tasks. Another sizable but parallelizable effort underway is implementing standard library functions which has existing function implementations to template from and is probably a good way to acclimate to the internal architecture and test suite.