Workflow - Scientist - Githubissues

betatim commented 8 years ago

Outline the envisioned workflow for a scientist. With this we can build a better idea of what needs teaching, blue-printing, etc

First suggestion for a workflow:

start new data analysis by creating a empty directory
type openscience init to create a skeleton
- runs git init, creates a "sensible" Dockerfile
- setups up aliases for running things in the docker container?
create code, run it with openscience run <cmd> which executes it inside the docker container
create a notebook or .md with code blocks that has narrative mixed with steps for reproducing parts of the analysis
git commit all along
push repo to git hub at some point(?)
as analysis comes to an end create a new ipynb/md that is the paper, preview it with openscience paper(?)

(I will edit this entry as we iterate)

betatim commented 8 years ago

A minimal git repository that would work as "executable paper" for the "Great icecream preference study of 2016":

icecream-prefs/
|-> icecream/
|   \-> ... library code ...
|-> data/  # include directly for small data or mount point for data-volumes
|-> Dockerfile  # how to setup the environment
|-> paper.{ipynb,md}  # the executable paper
|-> travis.yml  # CI instructions

The paper.{ipynb,md} drives the analysis, all the heavy lifting is done somewhere inside icecream/.

We can provide a the openscience command-line tool that create this layout, and uses it to allow you to run stuff locally using the docker container described in Dockerfile

betatim commented 8 years ago

ping @ctb (who should start watching this repo for all the notifications all the time)

betatim commented 8 years ago

What you would see as the publication is the rendered version of paper.{md,ipynb} with the ability to edit it and re-run. paper.md is like a README for producing the conclusions of the paper but without having to copy and paste stuff. So it might well contain code chunks that say

Figure 4 shows that clear chocolate is the best flavour. To run the extended analysis on chocolate we run:
 `` `
 make step23
 `` `
Which produces the following table:
...

betatim commented 8 years ago

The entry point would always be to-be-invented-execute.sh paper.* This should do all the things that need doing if placed inside the docker container built according to the Docekrfile in the repo. I could be persuaded that we need a .analysis.yaml but right now I am not sure, given you can do what ever you want in the Dockerfile.

We should provide a base docker image that contains to-be-invented.sh and other useful things like the jupyter kernel+plumbing needed to run a Rmarkdown, pythonmarkdown, ipynb.

travis.yml could be as simple as instructing travis to build our container from the Dockerfile and then execute to-be-invented.sh paper.* as well as a step to upload the rendered version. We should provide a template for this travis.yml so people can set it up.

ctb commented 8 years ago

Yes, good stuff!

My concern is that if this is the only allowed structure and workflow (as opposed to merely a strongly recommended one) we will automatically lose most potential early adopters - essentially, anyone who is already doing their own thing in this area. With the specfile idea we could allow a much broader range of repo structure/workflow (and provide a Web site to build the spec by inspecting a repo), while using the above as a specific structure & workflow for demo purposes.

ctb commented 8 years ago

A strong -1 on it being a shell script - something declarative offers many more opportunities for simplicity and introspection and composition. If procedural (like a Dockerfile or a shell script) then we need to run it to find out what it does. With a YAML spec, it could specify what resources need to be present, along with inputs and outputs, and then everything (travis.yml) could be produced from that, no?

betatim commented 8 years ago

If you are doing your own thing, and don't want to make a paper.md that we can render, what do we show as the "executable paper"?

We could also inject the required stuff via docker-compose. This might remove the need for a shared base image.

ctb commented 8 years ago

On Tue, Feb 23, 2016 at 06:30:28AM -0800, Tim Head wrote:

If you are doing your own thing, and don't want to make a paper.md that we can render, what do we show as the "executable paper"?

that can be part of the spec, no? We can require it be md or md-convertible, of course, but I still write my papers in LaTeX (for example).

We could also inject the required stuff via docker-compose. This might remove the need for a shared base image.

Yep!

betatim commented 8 years ago

Specifying resources is a pro for having .analysis.yml. You enter the world of pain of how do I specify a requirement like "CERN batch system circa Feb 2016 when the Os they ran was SLC6.blah" or "the batch system we have in University of Somewhere circa Feb2016". I think that is corner cases though or at least we should delay that for a while. Focus on things like "100GB RAM", "a GPU", etc.

Con for docker-compose, we need to know which version of the required-stuff image to inject. Could be noted in .analysis.yml.

I am open for supporting more formats for people to write their paper in. I would insist though that the format they use has a way of mixing code with prose (like a notebook). For markdown and ipynb I know how to do that. Do you know (sane) ways of doing this in LaTeX?

After reflecting on this over :coffee: I am :+1: on a .analysis.yml that specifies deps, docker image to inject for required stuff, and the command to generate the "rendered HTML with interactivity" paper.

ctb commented 8 years ago

On Tue, Feb 23, 2016 at 06:42:21AM -0800, Tim Head wrote:

Specifying resources is a pro for having .analysis.yml. You enter the world of pain of how do I specify a requirement like "CERN batch system circa Feb 2016 when the Os they ran was SLC6.blah" or "the batch system we have in University of Somewhere circa Feb2016". I think that is corner cases though or at least we should delay that for a while. Focus on things like "100GB RAM", "a GPU", etc.

Agreed on world of pain! And agree we should allow arbitrary config (perhaps via Dockerfile?) for corner/edge cases, but should encourage standardization.

Con for docker-compose, we need to know which version of the required-stuff image to inject. Could be noted in .analysis.yml.

I am open for supporting more formats for people to write their paper in. I would insist though that the format they use has a way of mixing code with prose (like a notebook). For markdown and ipynb I know how to do that. Do you know (sane) ways of doing this in LaTeX?

Good point -- and @camillescott has at least been playing with some tools.
I am +1 on that requirement, we can figure out latex later!

After reflecting on this over :coffee: I am :+1: on a .analysis.yml that specifies deps, docker image to inject for required stuff, and the command to generate the "rendered HTML with interactivity" paper.

coo'.

betatim commented 8 years ago

(note to future: in the above comment there is a sentence from titus hidden in what looks quoted text about figuring out latex later)

khinsen commented 8 years ago

A useful notion in Guix is the "build system", which is a package of tools and conventions to manage a build process. Guix has a build system based on autoconf/automake, one based on Python's distutils, etc. Considering that "building" means nothing else than "producing a digital artefact", this can easily be extended to computational science. Running a data analysis is the same as building a data analysis report.

Given the current state of the art (which is a mess), I think the best approach would be to allow arbitrary build systems, the condition being that they produce rendered output according to some criteria. Users would be strongly encouraged to use an existing build system rather than make their own, so in the end we'd have a few but not many.

Another aspect of Guix build systems worth copying is that the input to a build system is declarative and therefore analyzable.

tritemio commented 8 years ago

As food for thoughts. Let's keep simple things simple and hard things possible.

Many workflows only require python+cython or R. These cases should not be more complex because of the requirements of other workflows which requires building custom code etc...

betatim commented 8 years ago

Agree with you Antonino.

Being able to bring your own docker container to a HPC system, batch queue or use it on the LHC computing grid isn't possible right now. However movement towards that has started at CERN. HTCondor supports jobs-in-containers apparently.

re: custom software Take a look at https://github.com/betatim/everware-demo which builds on the image from https://github.com/betatim/everware-cern-analysis/blob/master/Dockerfile The point being that even for a demo you need quite some custom c++ software (which you drive from python) but I think it is well addressed with the approach we propose.

On Wed, Feb 24, 2016 at 4:55 PM Antonino Ingargiola < notifications@github.com> wrote:

As food for thoughts. Let's keep simple things simple and hard things possible.

Many workflows only require python+cython or R. These cases should not be more complex because of the requirements of other workflows which requires building custom code etc...

— Reply to this email directly or view it on GitHub https://github.com/betatim/openscienceprize/issues/16#issuecomment-188316668 .

cranmer commented 8 years ago

Recast project is very workflow oriented. https://github.com/recast-hep/

Here's a recent talk focusing on docker, and "parametrized workflows" for the LHC context. Workflow stuff starts around slide 10. https://indico.cern.ch/event/501469/

@lukasheinrich can do a better job of describing this, but here's a try:

We are preparing a document that describes high-level design for executing "parametrized workflows". We have iterated on a JSON schema to describe quite generic "parametrized workflows" or "workflow templates". We allow for each step of the workflow template to be in a different environment (in practice, we are mainly using docker).

In the current design there are schedulers that parse the workflow template and the various parameters that are needed to start executing steps in the DAG. @michal-szostakl is working on making this talk to various types of clusters (AWS, carina, google container project, CERN container project, etc.). This produces what we call a "workflow instance" (eg. the specific jobs that ran, their outputs, etc.) and that can be described with something like PROV.

(boxes are "activities" and circles are "Entities" in the PROV language)

lukasheinrich commented 8 years ago

Hi all,

I think the model we came up with can be quite general and in my initial tests it was easy to describe even somewhat complex workflow graphs.

The reason we separated the "workflow template" from the "workflow instance" is that this maps better to how usually we think about these workflows. I.e. in our heads we think of a workflow stage as "process all these files from the previous stage in parallel" instead of thinking in terms of very concrete filenames / paths. Also sometimes the full graph is not known ahead of time (which is why we couldn't use snakemake / pydoit / and friends)

For the actual workflow instance that @cranmer posted above, this is the graph of the workflow template

stages

I intentionally modelled it such that it could be written down somewhat succinctly in a travis-like manner and can be executed locally (as @cranmer mentioned we're working on remote execution as well)

lukasheinrich commented 8 years ago

Also i agree with @tritemio. If stuff is really simple, it should stay simple and not be overly complex. If e.g. you can package all your requirements in a single e.g. docker image and run the workflow with

docker run <myimage> ./runworkflow arg1 arg2

you shouldn't need to specify a whole lot more.

lukasheinrich commented 8 years ago

this would be the simplest example of a single step process that is parametrizes by an input and output argument. As @ctb said, this more declarative way of specifying the workflow allows for many downstream applications. you can query e.g. what code is used (i.e. what docker images), what the interdependencies of various workflow steps are, what the parameters are etc. (that's what makes it easy for us to visualize)

context:
  inputparameter: ~
  outputparameter: ~
stages:
  - name: dummystage
    parameters:
      input: '{inputparameter}'
      output: '{outputparameter}'
    scheduler:
      scheduler-type: 'singlestep-from-context'
      steps:
        single:
          process:
            process-type: 'string-interpolated-cmd'
            cmd: 'echo {input} {output}'
          publisher:
            publisher-type: 'process-attr-pub'
            outputmap:
              step_output: output
          environment:
            environment-type: 'docker-encapsulated'
            image: busybox

workflow template: adage_stages

workflow instance: adage_workflow_instance

ctb commented 8 years ago

I like all of the comments here!

What about including links to these issues in the proposal? I don't think we want to say we've reached any conclusions yet, and the proposal is due tomorrow, but I think these discussions are incredibly valuable and we can point to them as initial progress.

betatim commented 8 years ago

That is a good idea! :thumbsup:

On Sun, Feb 28, 2016 at 4:16 PM C. Titus Brown notifications@github.com wrote:

I like all of the comments here!

What about including links to these issues in the proposal? I don't think we want to say we've reached any conclusions yet, and the proposal is due tomorrow, but I think these discussions are incredibly valuable and we can point to them as initial progress.

— Reply to this email directly or view it on GitHub https://github.com/betatim/openscienceprize/issues/16#issuecomment-189890200 .

cranmer commented 8 years ago

See also #50 . Note there are two notions of "workflow" being discussed. One is how a user of everpub uses the tools. The second is the workflow coded up in the code itself, and more connected to composition etc.

lukasheinrich commented 8 years ago

Hi,

I recently stumbled on http://common-workflow-language.github.io/ and it seems like another workflow specification language, apparently used primarily bio/med fields.

Does anyone here have experience with this / know anything about it?

Cheers, Lukas

ctb commented 8 years ago

Maybe...

https://www.genomeweb.com/informatics/seven-bridges-funds-uc-davis-support-development-standardized-workflow-language

:)

It's kind of a meta specification, and while it's something we should support I didn't want to bake it into the proposal.

lukasheinrich commented 8 years ago

ugh.. behind a paywall even from NYU network. is there free info on this somewhere? Obviously there is an interest in this across fields in having something like this, which is good.

cranmer commented 8 years ago

I’m not premium, so I can’t see that article :-)

On Feb 29, 2016, at 1:44 PM, C. Titus Brown notifications@github.com wrote:

Maybe...

https://www.genomeweb.com/informatics/seven-bridges-funds-uc-davis-support-development-standardized-workflow-language

:)

It's kind of a meta specification, and while it's something we should support I didn't want to bake it into the proposal. — Reply to this email directly or view it on GitHub https://github.com/betatim/openscienceprize/issues/16#issuecomment-190327389.

ctb commented 8 years ago

On Mon, Feb 29, 2016 at 10:47:59AM -0800, Lukas wrote:

ugh.. behind a paywall even from NYU network. is there free info on this somewhere? Obviously there is an interest in this across fields in having something like this, which is good.

I can send you a PDF but that doesn't help in general, does it? Anyway, yes, I currently employ the CWL community manager, @mr-c ;).

lukasheinrich commented 8 years ago

That's great. So, I skimmed over that and it seems somewhat similar to our workflow spec. Maybe there is an opportunity there to converge. The one feature, I think that I haven't seen elsewhere, is flexibility in the workflow DAG itself. Our spec allows for extending the graph in certain ways while it is running, which is helpful in cases where the graph structure depends on the outcomes of previous nodes in the graph.

ctb commented 8 years ago

On Mon, Feb 29, 2016 at 10:55:20AM -0800, Lukas wrote:

That's great. So, I skimmed over that and it seems somewhat similar to our workflow spec. Maybe there is an opportunity there to converge. The one feature, I think that I haven't seen elsewhere, is flexibility in the workflow DAG itself. Our spec allows for extending the graph in certain ways while it is running, which is helpful in cases where the graph structure depends on the outcomes of previous nodes in the graph.

Sounds like a nice convergence! Drop me an e-mail titus@idyll.org if you want an e-introduction to @mr-c.

mr-c commented 8 years ago

Hello, @mr-c here. I'm the Community Engineer for the #CommonWL. I'm coming up to speed on what you all are doing and I see a lot of crossover.

One of my main personal motivations for CWL was that there should be a way to run the complete analysis graph from a paper AND re-mix/re-use it with your own data. Hopefully our 3rd draft of the spec provides much of the functionality needed by that.

I'm not sure why @ctb thinks of us as a meta-specification; CWL tool descriptions and workflows made from those tool descriptions are completely runnable on a local machine, in a docker container, or on an academic cluster/grid.

We have a chat room if you'd like some real time conversation at https://gitter.im/common-workflow-language/common-workflow-language

FYI @betatim we have Docker containers running on HPC systems without root: https://github.com/common-workflow-language/common-workflow-language/wiki/Userspace-Container-Review#getting-userspace-containers-working-on-ancient-rhel

lukasheinrich commented 8 years ago

Hi @mr-c,

so re-mixing is exactly a point where the flexibility in the graph itself becomes important. Think of this simple type of map-reduce workflow:

1) one step that take a couple of parameters and produces N files by running code in docker container A 2) process each of those N files in parallel using code in docker container B to produce N new files 3) merge those N result files into a single result using some code in container C

now, different input parameters might result in different number of produced files, so the actual graph becomes invocation dependent (though similar structurally between invocations). We solved this in our proposal by allowing for "schedulers" which take a graph -- as executed until this point -- and its invocation parameters to extend the graph with the nodes based on the results up to that point.

One approach of course is to hide the parallel computation in a single step that handles all of it, but that goes a bit against the re-usability ideal, since the core component one want to re-use is what happens in each node.

Have you encountered these things within the CWL development?

mr-c commented 8 years ago

Hello @lukasheinrich

Yep, this topic comes up frequently.

We currently support the scenario you outlined with our scatter/gather feature: http://common-workflow-language.github.io/draft-3/Workflow.html#WorkflowStep

I'm sure that additional dynamic features will be added after our 1.0 release. If we are missing anything, especially derived from the usecases presented in this repo, I really want to hear about it!

lukasheinrich commented 8 years ago

yes scatter / gather (which I guess is almost synonymous with map/reduce) is one very common way this graph extensions work, but probably there are more, so we wanted to make this a first-class citizen using the notion of "workflow templates", and "workflow instances". In our JSON-based workflow schema, we allow for arbitrary sub-schemas (which need to be supported by the workflow engine that runs them). This allows custom contributions / workflow patterns to appear organically (maybe curated by a community)

Another question: can one run workflows using different docker containers (for each node in the graph) using CWL? If so, how do you describe the environment (which docker container, how to setup a shell environment within the container etc) and how do you coordinate a shared filesystem between those containers? In our case, we allow a list of resources to be listed (such as a network filesystem or a shared host directory), and docker containers can expert to see the work directory at a well-defined path (e.g. /workdir). see this example:

https://github.com/recast-hep/recast-cap-demo/blob/master/recastcap/capdata/yamlworkflow/ewk_analyses/ewkdilepton_analysis/postproc.yml

mr-c commented 8 years ago

I'm sure we'll add additional dynamic workflow patterns as the standard develops.

With CWL you can indeed define a different docker container to use for each tool or step: http://common-workflow-language.github.io/draft-3/CommandLineTool.html#DockerRequirement

We are also adding support for giving hints to the local system in the event you'd like to execute a CWL workflow using a traditional HPC cluster.

File staging is left as an implementation detail for CWL compliant platforms (some are shared filesystems, many are not).

More on the runtime environment: http://common-workflow-language.github.io/draft-3/CommandLineTool.html#Runtime_environment

Specific files to be used in computation are specified in the input object, a JSON formatted list of input parameters including file locations.

mr-c commented 8 years ago

@lukasheinrich Is there a link for the recast workflow spec? We maintain a (depressingly long) list of other scientific workflow systems at https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems and I'd like to add y'all.

lukasheinrich commented 8 years ago

we're working on a draft right now, hopefully there'll be something presentable soon.

everpub / openscienceprize

Workflow - Scientist #16