everpub / openscienceprize

:telescope: Everpub - Making reusability a first class citizen in the scientific workflow.
Other
69 stars 20 forks source link

lobbying for official / supported docker images from scientific software projects #111

Open lukasheinrich opened 8 years ago

lukasheinrich commented 8 years ago

I talked about this a bit with @cranmer et al and maybe this is a good forum, also this is related to #51 .

A lot of software products are already a good fit for the Docker paradigm of wrapping a single entry-point / program / command line tool, with all their dependencies. I think it lobbying for large, widely used software products (as opposed to e.g. library provides that are meant primarily for re-mixing) to build official docker images can help both by 1) give at the very least a reference dockerfile on how the authors of that project would install their own software and 2) give already a useful

A perfect example in HEP would be ROOT. I think a ROOT docker base image would already go a long way for a lot of the scientific code that exclusively lives in the ROOT ecosystem.

Other HEP examples are Monte Carlo generators. These are also almost exclusively (at least by experimenters) used as black boxes that eat a couple of configuration files and spit out events in some format. Maybe another example could be GEANT? Maybe there are similar examples in biomed fields?

should we approach such projects and try to get them to have official docker images?

lukasheinrich commented 8 years ago

more concretely, for everpub, I'm thinking maybe we can have a collection of good base images to build from, some of which maybe these official base images.

somewhat similar how I e.g. heroku / travis provide some base environtment (either by autodetecting in the former case, or via the language specification in travis.yml

betatim commented 8 years ago

Approaching projects which don't already want to provide these images is tricky. Packaging is a lot of dirty and boring work, so if you approach a project their default answer will most likely be "if you think this is such a great idea we welcome PRs". Which isn't soo surprising. I think it would be super if this existed (but not so super that I will work on it).

For everpub one of the things that will stop people from trying it out is having to create their own Dockerfile from scratch so we won't get around having to make (and maintain!) some containers ourselves.

khinsen commented 8 years ago

I'd say the best moment to approach others asking them to make official Dockerfiles is when we have a prototype to show. Then we can ask "Do you want this for your users?" and we are in a much stronger position.

lukasheinrich commented 8 years ago

:+1:

so keeping this in mind, I would like it to be possible for images to be based on any base-image. This has been somewhat tricky with both everware and binder in that the only way to get code in there is to build docker images upon their respective bases.

cranmer commented 8 years ago

:+1: :+1: on this. I agree it would be nice if we could invert the layers for binder and everware. Currently you extend the binder base image with your stuff instead of binder adding what it needs to a random base image. Not sure how hard that is.

On Mar 1, 2016, at 6:52 PM, Lukas notifications@github.com wrote:

so keeping this in mind, I would like it to be possible for images to be based on any base-image. This has been somewhat tricky with both everware and binder in that the only way to get code in there is to build docker images upon their respective bases.

— Reply to this email directly or view it on GitHub https://github.com/everpub/openscienceprize/issues/111#issuecomment-190972666.

lukasheinrich commented 8 years ago

I think it should be reasonably easy. In any case, for any everpub-compatible docker image, the assumption is anyways that whatever everpub adds must be compatible with whatever the publication / project-specific code is doing. So I think inverting the order is technically not a problem..

khinsen commented 8 years ago

I don't know why binder and everware insist on "being first", but I wonder if the best way out is to allow multiple Docker images (see #51), of which one is reserved for the everpub infrastructure.

lukasheinrich commented 8 years ago

yes i'm thinking something similar. in my mind the best point of view is that an everpub "instance" will work as a mini-cluster / swarm

in short I'd like to be able to call (let's say from the notebook instance in the master container) something like:

import everpub.dockerapi
everpub.dockerapi.run(cmd = './make/some/plots.py /shared_workdir/out.png', container = 'thirdparty/container:latest')
import everpub.display
everpub.display.Image('/shared_workdir/out.png')

where the containers can share some state using shared volume-binds. Docker already has a functional python API, but maybe everpub can provide a nicer layer on top that takes care of binding a common filesystem etc..

@betatim, thoughts?

betatim commented 8 years ago

I like the idea of inverting the layers. Adding the everpub stuff at the end. For everware we provide a base image a) as a form of "specification" and b) to get people started. There is no need for a container to inherit as long as it has what everware expects in the right locations. There are some containers from the REP guys that have no everware heritage but work.

I would first build a system that is much simpler. Only one container that is it. We can always provide a DOCKER_HOST inside that container if the user feels the need to launch (a few) extra containers.

Why start simple? Most work can be done on a single host (given enough CPUs and RAM). Most people do most of the interesting science on their laptop/desktop. There is large scale processing but I will boldly claim that it is usually pretty dull in terms of actual science. To use LHC lingo: generating Monte Carlo simulations, processing data or simulation into ntuples, etc is large scale, requires huge amounts of CPU time and disk, but is also mostly uncontroversial/standard by virtue of being done so often at such a scale. Because it is such large scale though it is incredibly hard to share, if you don't have access to the LHC grid you just aren't going to be able to do it and if you do have access you can use the collaboration specific tools.

You need to keep track of these parts of an analysis for sure. However I think we can add more value by focussing on the later stages.

Usually the data volumes are much, much smaller, require less CPU and are far less standardised. A lot more choices are made and a lot more "science" happens. I like to focus on this part as it is more bang for our buck. It is a simpler problem (from a computing POV) that has more impact on science (because more choices are made).

(apparently adding horizontal lines confuses the markdown/email parser that thinks it indicates that the signature starts)

On Wed, Mar 2, 2016 at 6:32 PM Lukas notifications@github.com wrote:

yes i'm thinking something similar. in my mind the best point of view is that an everpub "instance" will work as a mini-cluster / swarm

  • 1 everpub docker "head/master" container
    • contains possibly the frontend to display prose / research result objects (plots, tables, whatever, possibly in notebook format)
    • if possible could also container code / software, but not necessarily
    • contains bindings from the frontend that allow calling other contianers
  • zero or more other containers
    • can be third party containers
    • do not need any additional infrastructure (though maybe for rich interaction we can think about providing a layer that can optionally be installed)

in short I'd like to be able to call (let's say from the notebook instance in the master container) something like:

import everpub.dockerapi everpub.dockerapi.run(cmd = './make/some/plots.py /workdir/out.png', container = 'thirdparty/container:latest') import everpub.display everpub.display.Image('/workdir/out.png')

where the containers can share some state using shared volume-binds. Docker already has a functional python API, but maybe everpub can provide a nicer layer on top that takes care of binding a common filesystem etc..

@betatim https://github.com/betatim, thoughts?

— Reply to this email directly or view it on GitHub https://github.com/everpub/openscienceprize/issues/111#issuecomment-191340858 .

ctb commented 8 years ago

+1. If we get awarded the prize, we should focus on the smaller compute things first, because I think that will be of more value to biomed people. I can justify that more if/when the time comes :)

lukasheinrich commented 8 years ago

:+1: I agree as well i.e.

1) most stuff should be doable within a single instance 2) inverting the layers / just having a spec what and where everpub needs things (posisbly with a installtion script or something like 'pip install everpub' 3) providing a DOCKER_HOST if necessary to the main instance so that it can launch sibling containers 4) not sure what others feel, but I would argue that a everpub api layer on top of docker would be helpful

I'm fine if we start with 1) and 2). I might have a prototype example notebook of 3) and 4) based on our workflow stuff (which does indeed run in this master/sibling setup on a docker cluster) in the next days.

the reason why I want to emphasize this, is that I have our current analysis workflows in mind, where for one stage we are bound to the ATLAS software releases (the 'dull' stuff @betatim mentions, event selection / reduction etc).. but later on, the more hands-on analysis / result presentation stuff lives in a completely different environment. So I want to be able to run the dull stuff, but not chain myself to the software choices of that environment for my later stuff.

anaderi commented 8 years ago

I guess it is not matter of order of inheritance, but matter of entrypoints of a [analysis] container, that should be somehow agreed upon to be suitable for running the analysis in different environments:

each of those environments requires different entrypoint: Makefile, jupyter notebook, jupyterhub, .travis.ci + test_scripts

in case of REP we did it simply by script that launches different stuff depending on environment variables. I'm not sure what would be the best way to generalize this approach without much risk of being cut by Occam's weapon. Possibly we could identify

so when the analysis is started within certain environment (given those environments are not difficult to discriminate), corresponding entrypoint becomes available, say, by everpub command-line utility.

Does it make sense?