everpub / openscienceprize

:telescope: Everpub - Making reusability a first class citizen in the scientific workflow.
Other
70 stars 20 forks source link

Meta-issue re composability #51

Open ctb opened 8 years ago

ctb commented 8 years ago

I noticed that #18 veered into some really great discussions of composability and I want to close that issue (because most of it has been dealt with by #41) but retain a link to composability.

So, put links to good comments about composability in this issue and we'll revisit if/when people want to talk about it more :).

lukasheinrich commented 8 years ago

I think we touched on this also a bit in https://github.com/betatim/openscienceprize/issues/16 . There is a little bit of tension between an efficient day-to-day workflow / research development and making it accessible / recomposable / remixable to a wider audience.

For example, I think most of the discussion in proposal.md has been around the idea that you have a single Docker image that has everything (e.g. all that library code @betatim mentioned in https://github.com/betatim/openscienceprize/issues/18) and a notebook frontend to tell the story. This makes it monolithic, not necessarily because of the use of the notebook format but rather because everything is packaged into a single image that represents the unique mix of tools for the paper / research project at hand.

So, Docker is great for packaging stuff but it won't absolve us form listing the dependencies in a human /machine readable form (something like pip freeze and pip install -r requirements.txt) Currently it's not easy to extract this information from a Dockerfile. Are there any existing tools that attempt to do this?

khinsen commented 8 years ago

I think this is the wrong approach. Docker is a deployment tool, period. You don't extract anything from a Dockerfile. You need some other build system, which could produce Dockerfiles among other outputs. Many Dockerfiles just start from Debian and install packages with apt-get, in that case the build system is Debian.

Another way to put this is that Docker is the code equivalent of PDF, not of LaTeX.

ctb commented 8 years ago

I think it would be straightforward to have the Dockerfile use the specfile to figure out that conda is what should be run. No reason to put the actual deps directly in the Dockerfile...

lukasheinrich commented 8 years ago

@khinsen I don't disagree with that. I took the Dockerfile as an example because this is in practice often one place where all the dependencies are spelled out (across all kinds of dependencies you might have, python deps via requirements.txt, system libraries via apt-get or yum etc. ). Not sure if people already keep such a comprehensive list elsewhere.

I would be very happy if there could be a way to specify those form another source. I agree that the dockerfile (and the resulting image) should also be just one project output among others (i think the pdf/latex analogy is apt). All that library code should be independently installable, with perhaps a container image being a reference implementation / installation.

cranmer commented 8 years ago

Isn’t the analogy:

dockerfile ~ .tex docker build and base image ~ tex compilation engine docker image ~ pdf

I agree with the point that you need apt-get etc. for it to run, but the docker file is much more self-documenting than the resulting image.

On Feb 28, 2016, at 3:58 PM, Konrad Hinsen notifications@github.com wrote:

I think this is the wrong approach. Docker is a deployment tool, period. You don't extract anything from a Dockerfile. You need some other build system, which could produce Dockerfiles among other outputs. Many Dockerfiles just start from Debian and install packages with apt-get, in that case the build system is Debian.

Another way to put this is that Docker is the code equivalent of PDF, not of LaTeX.

— Reply to this email directly or view it on GitHub https://github.com/betatim/openscienceprize/issues/51#issuecomment-189945323.

betatim commented 8 years ago

Agreed on that.

I am not sure you can make something as flexible as a Dockerfile and even easier to comprehend by a computer. Sometimes all you need is yum install long-list-of-packages or conda create <env-file> but often you want to do both, then clone a specific tag of a git repository, patch the configure script and then make && make install it.

In my experience a Dockerfile is actually pretty darn good to work out how to setup things in your personal environment that is different from the one that the Dockerfile is based on. For humans it is doable, for computers this is very hard (in general and given a Dockerfile).

Does someone know successful cross-platform (as in gentoo vs readhat vs alpine vs centos vs suse) package managers? I can't think of any right now.

khinsen commented 8 years ago

@cranmer The difficulty with the analogy to TeX->PDF is that the latter is a single-layer operation, whereas code building and deployment has become such a mess that we use several layers.

Take a basic application written in Python. It would typically come with build instructions for distutils. Thent it's integrated into Debian, creating a package specification and a package. And then someone builds a Docker image via a Dockerfile that starts from basic Debian and adds the package.

I'd argue that in this case TeX <-> Python, tex-command-line<->setup.py, and PDF<->installed-code. The Debian and Docker layers are just conversions to different build systems.

The real question here is: Which layer is the best notation to extract dependency information for other uses? Not the Dockerfile, in my opinion, because you can't be sure to find anything useful in there. The Dockerfile that says "FROM debian, ADD apt-get..." is of little help - you have to parse the Debian package spec after that. But the Debian spec is hardly better, it refers to yet another underlying build system. And both Docker and Debian use imperative specs, which are notoriously hard to analyze.

I suspect that it's best to introduce another notation for dependencies, rather than work around the problems with notations made for something else. Something simple and declarative. Perhaps there is something around that we can reuse, of course.

khinsen commented 8 years ago

@betatim No, but I don't think anyone has tried. Packaging is so difficult that doing it in a portable way is probably asking for too much.

khinsen commented 8 years ago

Anyone interested in the composability aspects should have a look at this writeup of a recent Twitter conversation.

lukasheinrich commented 8 years ago

I think the core issue is that we want projects to have multiple kinds of research output that live on a spectrum between these two ends:

1) 'deployed' data and code (e.g. docker images, the environment in which we can e.g. reproduce a given publication). here all the dependencies etc are explicitly satisfied

2) software as a research product intended for re-use / composability. e.g. your classic library. ideally we at least have a human / machine readable list of dependencies / build requirements, but installation is the responsibility of the re-user / the package manager (and multiple pieces might have mututally exclusive deps)

there is valuable space in between those edges in the sense that one can deploy 'black boxes' with well-defined input and output. think: a library function with all necessary dependencies included.

... and now I see @khinsen link to a document that shows exactly this spectrum :+1:

I think all those things are not mutually exclusive. A good project should provide useful full deployments for reproducibility / limited re-usability, but also provide the possibility to install manually into a larger project.

khinsen commented 8 years ago

@lukasheinrich I completely agree with your description. And nothing is mutually exclusive, as you say, but the constraint of repurposing existing technology makes it hard to satisfy them all. In fact, my ActivePapers approach does everything you list, at the price of being incompatible with 99% of existing research software. But then, ActivePapers is research software for exploring these issues, not a tool made for widespread use in real life.

I am rather pessimistic about satisfying all criteria while being fully compatible with the past - I think this requires too much accidental complexity for anyone to handle. But I'd be happy to be proven wrong.

ctb commented 8 years ago

I suspect that it's best to introduce another notation for dependencies, rather than work around the problems with notations made for something else. Something simple and declarative. Perhaps there is something around that we can reuse, of course.

Aieeeeeeeeeeeeeeeeee

I think we need some real, concrete use cases here to ground this discussion...
khinsen commented 8 years ago

@ctb I said "introduce", not "invent". Introduce something else in addition to the Dockerfiles used for building images.

lukasheinrich commented 8 years ago

There is some prior art to wrapping stuff around Dockerfiles / auto-generating dockerfiles from a more machine readable spec like here https://www.packer.io/ I think @anaderi has some experience with it.

lukasheinrich commented 8 years ago

What I feel is missing currently from the Docker ecosystem is a clean way to compose layers from different pieces. Say you have a couple of layers that you know will be compatible (in that they would not overwrite files within the other layers respectively) you should be able to compose those as if you just bind mounted them during docker run docker run -v /some/fileroot:/layermount1 -v /another/fileroot:/layermount2 <base image>

I think the ADD keyword works in some sense like that if it is followed by an tar archive (see note in Dockerfile reference)

If <src> is a local tar archive in a recognized compression format (identity, gzip, bzip2 or xz) then it is
unpacked as a directory. Resources from remote URLs are not decompressed. When a directory is
copied or unpacked, it has the same behavior as tar -x: the result is the union of:

but it's not really a first-class citizen. If one wanted to merge two docker images one would need to export both to a archive and add them back via

docker export image1 > archive1.tar docker export image2 > archive2.tar

and build a new file via such a Dockerfile

FROM ??? (maybe common base of both image1 / image2)
ADD archive1.tar
ADD archive2.tar

and then hope for the best :-/. Maybe there is no better solution, but currently lots of people are writing very similar Dockerfiles and streamlining the process somehow must be possible.

The proposal did include a statement that we do want to be opinionated in certain ways. So maybe we can carve out an idea of what it means to be a "everpub"-compatible Docker image that then has some guarantees (via conventions in how they are built) to be composable)

lukasheinrich commented 8 years ago

Last thing I wanted to add. Within the high energy physics community there has been a quite successful effort to streamline software distribution housed in a single global filesystem tree, called CVMFS (https://cernvm.cern.ch/portal/filesystem), with an additional toolchain to easily setup the shell with software from /cvmfs (in ATLAS it's called AtlasLocalRootBase (not sure why, not sure what other LHC experiments use https://twiki.atlas-canada.ca/bin/view/AtlasCanada/ATLASLocalRootBase, maybe @betatim and @anaderi can comment how this is done in LHCb) . which allows me do e.g. setup various software produces (across a wide range, from compilers to very specific software used by a single experiment)

some examples external software lsetup gcc493_x86_64_slc6 HEP-specific software lsetup "root-6.04.10-x86_64-slc6-gcc48" ATLAS software asetup AnalysisBase,2.3.33

Since this is a global filesystem, installation there is done using some political process and you need to have the entire thing + a network connection, but maybe we can re-use some of the insights as a model how to build ad-hoc coherent filesystems (which then could be docker imported from various, not necessarily coordinating sources, i.e. a kind of "mini-cvmfs" where I can pick&choose which parts I want (maybe just a single compiler, a single ROOT version, instead of all the various options)

betatim commented 8 years ago

Provocative mode: if the two images would layer perfectly, couldn't we just cat a/Dockerfile > c/Dockerfile && tail -n +1 b/Dockerfile >> c/Dockerfile? Or maybe the person should have use a as a base for b and we use b as base for c?

My guess as to why there is so much duplication right now is that it is too easy to make a Dockerfile and there isn't (in HEP) an established body of high quality containers with support and docs etc. (A bit like when there was a new python web-framework every week, it was too easy to make one, in the long run flask "won"). Right now it is easier for me to look at Dockerfiles you made, then take the bits I want/need and put them in my personal and super special Dockerfile than to try and understand how to leverage yours. (Which is dumb because in the long run I'd rather you maintain these images than having to do it myself but short sightedness is a thing)

On Tue, Mar 1, 2016 at 1:50 AM Lukas notifications@github.com wrote:

Last thing I wanted to add. Within the high energy physics community there has been a quite successful effort to streamline software distribution housed in a single global filesystem tree, called CVMFS ( https://cernvm.cern.ch/portal/filesystem), with an additional toolchain to easily setup the shell with software from /cvmfs (in ATLAS it's called AtlasLocalRootBase (not sure why, not sure what other LHC experiments use https://twiki.atlas-canada.ca/bin/view/AtlasCanada/ATLASLocalRootBase) . which allows me do e.g. setup various software produces (across a wide range, from compilers to very specific software used by a single experiment)

some examples external software lsetup gcc493_x86_64_slc6 HEP-specific software lsetup "root-6.04.10-x86_64-slc6-gcc48" ATLAS software asetup AnalysisBase,2.3.33

Since this is a global filesystem, installation there is done using some political process and you need to have the entire thing + a network connection, but maybe we can re-use some of the insights as a model how to build ad-hoc coherent filesystems (which then could be docker imported from various, not necessarily coordinating sources.

— Reply to this email directly or view it on GitHub https://github.com/everpub/openscienceprize/issues/51#issuecomment-190472824 .

lukasheinrich commented 8 years ago

yeah maybe that's stretching it, and I can think of various things that can go wrong, even it it worked from a filesystem layering point of view (think conflicting ENV statements, etc). As to the point that one should use the other as a base, I think this is exactly the core of the problem. With docker these two things don't commute (A based on B and B based on A result in different images) so this leads to the point were we would create long chains of image layers like

cern/slc6-base + grid middleware + ROOT installation + custom library + analysis code + ...

which takes a lot of discipline and upfront thought on how to layer everything together, if you want these to be re-used.

So I actually like this "(composable) mini-/cvmfs" idea. the cvmfs maintainers seem to at least have some workable model and maybe they have some insights how to provision e.g. things like /usr/lib etc so that they can fulfill the (perhaps conflicting ?) requirements

khinsen commented 8 years ago

@lukasheinrich Considering at which level Docker containers operate (everything but the kernel), I'd be surprised if you could find two working Docker images that have no files in common. Every Docker image contains some Linux infrastructure stuff.

In my paper on ActivePapers, I use the concept of a "platform" able to accomodate "contents". The platform is the infrastructure you rely on, the contents is what you generate on top of that. Think of "MP3 player" vs. "MP3 files" as a simple example.

In the case of Docker, the platform provides the Linux kernel, and the contents are application software with all the Linux elements they require, except the kernel. This is inherently not composable. In fact, the real problem is that the traditional Unix model of software installation is not composable (see this blog post). One main motivation behind Docker was to work around the non-composability of Linux software installations. You cannot compose containers either, but at least you can run multiple containers on the same host system. In an OS with composable software installation (such as Nix/Guix), there is much less need for containers. They still have their place as sandboxes for secure execution, but that's a much less fundamental role.

khinsen commented 8 years ago

@betatim Your summary of Python Web frameworks is pretty much the standard story of innovation across human history. New technology starts out in a "high-temperature" state with lots of variation, which then "cools down" as consensus is reached on how to do things.

Since this is also the history of the universe, our solar system, and our planet, perhaps we should accept it as a law of nature and learn to live with it :smile:

In fact, I wonder if the specific problems of computing are perhaps the consequence of a dysfunction in this process. The complexity of dealing with compatibility issues forces consensus much too early, leading to de-facto standards that aren't really mature but that people prefer to live with. Another problem is that consensus is often reached through market dominance rather than by technical merit. That happens elsewhere as well, of course, but I suppose it's more pronounced in computing.

lukasheinrich commented 8 years ago

Before we start turning in circles, maybe we can propose a couple of options how we can help with composability. I see two main areas

1) have tools in the everpub toolchain that allow an easy use / execution of multiple docker images / containers.

E.g. for my projects, I am expecting that I will be managing multiple docker images / containers that each encapsulate different pars of my work (stuff based on ATLAS software in one image, post-processing in an image using only ROOT + python)

so I think a good starting point is to assume that the everpub tools are executed on a machine that has e.g. the docker-client + $DOCKER_HOST set (e.g. perhaps using a Carina setup)

2) have tools in the everpub toolchain that allow an easy creation of good(TM)/best-practice Dockerfiles (or docker images directly)

I'm spending way too much time creating Dockerfiles / building images from them. things like sharing provisioning scripts seem sensible (even if they don't have work for everyone). There's no reason everyone needs to figure out how to build ROOT like this

RUN git clone --quiet http://root.cern.ch/git/root.git /code/root-v6-02-12 &&\
    cd  /code/root-v6-02-12 &&\
    git checkout v6-02-12 &&\
    ./configure --all &&\
    make -j4 &&\
    make -j4 install &&\
    cd / &&\
    rm -rf /code

I do think that guix (as a build system within a docker image ) and packer.io (as a tool to build docker images from a machine-readable spec) are very interesting option that go beyond dockerfiles.

khinsen commented 8 years ago

@lukasheinrich Point 1) is clearly something we need to address. I don't see any particular difficulty either, but that may come :-(

As for point 2), I am not sure what problem you want to solve exactly. If you want to facilitate building ROOT, just put your image on Docker Hub, or bundle your Dockerfile with ROOT itself. I suppose you are suggesting that others could profit from your ROOT Dockerfile in doing something similar but different, but I fail to see how exactly that would work.

In any case, the problems of building and using Docker images are largely independent, so they can be attacked in parallel. And I strongly suspect that someone has already thought about best practices and tutorials for building Docker images for science.

khinsen commented 8 years ago

I just published a blog post about composition, with background information relevant to everpub.

lukasheinrich commented 8 years ago

interesting read. recently Do you have any thoughts on IPFS as a solution to global content-addressable storage? I've been thinking this could be a solution (certainly not by us, but the community) to the commutation problem in docker images (A after B is different from B after A even if they were compatible). They are teasins as much on their blog, but there has not been news on this since https://ipfs.io/blog/1-run-ipfs-on-docker/

khinsen commented 8 years ago

I have mentioned IPFS before in a blog post on data management, and I do see it as the right direction. Whether IPFS as an implementation will work out remains an open question, there is not enough experience with it. But the principles behind it are certainly very promising.

Do you know anyone who has actually used IPFS in real life?

lukasheinrich commented 8 years ago

not personally. I, myself, just recently stumbled on it and think it's a promising effort I want to keep an eye on.