boyleworkflow / boyle

A tool for provenance and caching in computational workflows.
GNU Lesser General Public License v3.0
0 stars 0 forks source link

Build Boyle language as internal DSL? #1

Open rasmuse opened 8 years ago

rasmuse commented 8 years ago

I really like the split-apply-collect / map-reduce structure we discussed last time. This is definitely a step in the right direction.

However, I find the workflow file example we built to be rather hard to follow. An important problem in my view is that the definitions "inside" the split, i.e. between the split and the collect look exactly like the other definitions at a quick glance.

So I have tried for a couple of hours to come up with a succinct way of adding another tab level/nesting level for the nested operations, to make them graphically stick out a bit more.

This went so badly that I instead started scribbling down the same thoughts in the form of an internal DSL in Python, and that just felt much easier to get right. This has both pros and cons, of course... but I'm increasingly convinced that we should at least build the language on something with a bit more expressiveness than JSON/YAML. A couple of interesting examples of how this can be done in Python is doit and Luigi. None of them are perfect, but all the examples I've found are actually pretty legible, pretty easy to understand.

Anyway, please see examples here! The best one in my view is variant 3.

define function

The basic building block I have played with is a define function, working something like this:

w = Workflow()

define(w.some_input_file ~ LocalFile('indata/file1.txt'))
define(w.another_input ~ LocalFile('indata/file2.txt'))

define(
    w.concatenated ~ LocalFile('path/to/expected/output.txt'),
    [w.some_input_file, w.another_input],
    Shell('cat {inp[0].path} {inp[1].path} > {out.path}'))

Here, the definitions of the two input files essentially say "we should expect to find two files on the local drive named 'indata/file1.txt' and 'indata/file2.txt' without doing anything", in other words they are root nodes.

The last definition essentially says "w.concatenated is the name we give to a file which will be located at 'path/to/expected/output.txt' if the shell script cat ... is run with those two input files in place".

map and reduce

My idea is to make use of Python's with keyword through context managers to allow a more visible formulation of the split-apply operations. We could probably come up with a better context manager than my Each() sketch, but the example should explain the gist of it.

with Each(w.a_splittable_output) as item_name:
    # define things using item_name here
    define(w.something ~ Value(), item_name, Python(...))

# and then collect/merge/reduce the stuff based on item_name here
list_of_somethings = collect(something)

Of course it's then possible to extend the syntax to allow the named "dimensions" we talked about, add some form of keys to the mapped items that can be used in more complicated reductions than "collect", etc...