demo: notebook with attached cluster + declarative workflows

lukasheinrich commented 8 years ago

Hi all,

this touches on #16 #51 and other issues

as promised a small demo on how one could use both declaratively defined workflows + a docker swarm cluster to run workflows whose steps are each captured in different docker containers. This

https://github.com/lukasheinrich/yadage-binder/blob/master/example_three.ipynb

in the GIF, each of the yellow bubbles executes in its own container, in parallel if possible. All these containers and the container that the notebook runs in share a docker volume mounted at /workdir so that they can share at least a filesystem state. This keeps the execution itself isolated but allows steps to read the outputs of previous steps and take them as inputs.

let me explain the different parts:

adage: https://github.com/lukasheinrich/adage

this is a small workflow tool I wrote in order to be able to execute arbitrary DAGs of python callables in cases where the full DAG is not known upfront, but only develops with time. I keeps track of a graph and has a set of rules of when and how to extend the graph

yadage https://github.com/lukasheinrich/yadage

this is the same concept but adds a declarative layer. In effect it defines a callable based on a JSON file like this one

https://github.com/lukasheinrich/yadage-workflows/blob/master/lhcb_talk/dataacquisition.yml

that defines a process with a couple of parameters complete with its environment and a procedure how to determine the result.

this is already helpful to use docker container's basically as black-box python callables like here:

https://github.com/lukasheinrich/yadage-binder/blob/master/example_two.ipynb

On top of these callables, there is also a way to define complete workflows in a declarative manner like here:

https://github.com/lukasheinrich/yadage-workflows/blob/master/lhcb_talk/simple_mapreduce.yml

https://github.com/lukasheinrich/yadage-binder/blob/master/example_four.ipynb (try changing the number of input datasets, but don't forget to clean up the workdir using the cell above)

which then can be executed by the notebook. As a result we get the full resulting DAG (complete with execution times) as well as PROV-like graph of "entities" and "activities".

yadage-binder

just a small wrapper on top of yadage that install the ipython notebook.. it doesn't really work in binder as originally intended since I can't get binder to have writable VOLUMES. So currently you have to start it on carina instead like so

docker run -v /workdir -p 80:8888 -e YADAGE_WITHIN_DOCKER=true -e CARINA_USERNAME=$CARINA_USERNAME -e CARINA_APIKEY=$CARINA_APIKEY -e YADAGE_CLUSTER=yadage lukasheinrich/yadage-binder

where you pass your carina creadentials and the cluster name

betatim commented 8 years ago

This is nice!

For me a key take away (beyond the fact that you are building nice new tools) is that you have a notebook from which the user drives stuff. In this case the "stuff" is fairly complicated and does a lot of work and yet you can use a notebook to then explain to someone what you just did using words. To me this shows:

if you don't need extra containers and stuff, then you should certainly be able to control your analysis from a notebook
people can use their favourite tool on the inside of a "everpub publication" (or what ever you would call yadage-binder if it was a real science project.

lukasheinrich commented 8 years ago

Hi Tim,

yes exactly.. making python callables from docker containers allows as to abstract a lot of the complicated stuff. In fact this workflow here:

https://github.com/lukasheinrich/yadage-binder/blob/master/example_one.ipynb

calls monte carlo generators to calculate QFT matrix elements, runs the parton showering, runs a pseudo detector simulation using ROOT etc.. but this is all packaged up nicely in their own (and possibly 3rd party provided) docker containers and does not need to be in the container that houses the notebook.

For the more light-weight stuff you can still install all needed dependencies in your notebook and go from there (for example pick up the HepMC file generated by a Monte Carlo docker container, but analyze it further from the notebook). I.e. you can freely mix how much complexity you want to have in your notebook container vs outsourcing jobs to other containers

Also I like that it's very expandable. If you define the workflow in a declarative manner such that you have access the the DAG you can distribute the computing across the docker swarm (if the swarm supports networked volume drivers)

lukasheinrich commented 8 years ago

so it seems like there are three levels of code that can be composed at will

1) you start in a notebook dabbling around with your code 2) if you realize you need this code more often, you are very likely to actually put that code into their own packages modules and only call those libraries from the notebook 3) if you realize that your notebook image becomes very heavy or complicated to satisfy all the dependencies of the different environments you need in a single machine, you can put some code in it's own docker container so that it has it's own set of dependencies and doesn't need to play nice w/ everything else. you can then call those dockerized codes via a python (R?) API as if they were native callables

I think this way of thinking will give a lot of flexibility with everpub.

khinsen commented 8 years ago

@lukasheinrich That looks good from the user's perspective, assuming that it works without bad surprises in practice. Somehow the idea of russian-doll docker containers is weird... And I wonder if going back to statically-linked executables wouldn't be simpler in the end!

lukasheinrich commented 8 years ago

@khinsen statically linked execs certainly are a nice solution to the dependency problem. I think they fall squarely in the category 2) of code as outlined above.. I.e. statically linked execs should be easy to integrate on the main host. So if you have access to them or are able to compile them yourself, that's great.

For HEP at least, the use of dynamic linking is very wide-spread, often with runtime loading of libraries depending on e.g. inputs etc.

everpub / openscienceprize

demo: notebook with attached cluster + declarative workflows #116