Unconference tutorial: Escaping dependency hell -- Docker for reproducible research?

cboettig commented 9 years ago

Academic research depends on a software ecosystem of ever-increasing complexity. Moreover, each researcher's software environment is unique -- make use of different tools, different libraries, and different versions. These details are rarely fully described even for the researchers themselves. This poses a substantial barrier to reproducibility.

Docker provides a 'shipping container' to easily share your software environment with others. Unlike existing solutions, Docker isn't monolithic -- use the parts you like. This has made it very successful in the world of professional software developers because they, like researchers, have developed their own favorite tools and ways of doing things and don't want to change, but still need an easy way for others to run their software.

This tutorial would introduce Docker by illustrating 4 key concepts desirable in any approach to reproducible software environments:

A flexible approach: We don't want to make any assumptions about a user's preferred OS, text editor, etc. (Docker runtime)
An extensible approach: A user should be able to extend & repackage the environment with any of their favorite tools with minimal learning curve. (Docker containers)
A community approach: Common extensions of tools & combinations should be developed & maintained as a community base environment. This saves time and permits optimization without restricting flexibility of individual users. (Docker Hub)
A DevOps approach: Uses scripts instead of manuals to install. These are human-readable, machine-readable, extensible, portable, & easily versioned. (Dockerfiles)

This would be a hands-on demo of running a 'Dockerized' environment, extending it, committing & sharing those changes. (We probably do this using RStudio, though I could also demonstrate this for ipython-notebooks or other computational environments).

dmlond commented 9 years ago

I think this is a great idea for a session, and completely hits on all the advantages of containers. We might want to talk a bit about creating small, modular application components, as well. I can certainly see the power of RStudio and ipython approaches to reproducible analysis, but there will be users who want to write their analysis pipelines in a unix environment using simple commandline tools. There will also be researchers that have written a complex application (c, perl, python, java, etc.) that they want others to use. We could demonstrate to these users how to create a very thin wrapper image, with just the base OS, required packages, interpreter/compiler, and libaries, and the application that they can share with anyone and allow them to run the application without all the fuss that you mention. I have some very simple docker images in both github and the docker registry that demonstrate a reproducible, reusable, and extensible next gen sequencing analysis based on bwa and samtools. We could then talk about how to extend one of these into an RDesktop or ipython extension image that allows these applications to be used inside them if you want.

cboettig commented 9 years ago

@dmlond Yup, I completely agree that we'd want to show a custom/complex application. After all, the use case is most compelling when working across different libraries where just using a language's built-in package manager won't cut it. That said, the nice thing about the web-based consoles like RStudio or ipython-notebook is that the audience may already be familiar with one of those interfaces but not know their way around a Unix command line. Might be good to do both?

I was thinking that it would be easiest to start off the tutorial just showing interactive use, and build up to writing a Dockerfile.

1) Install docker & run an existing container 2) Install/modify stuff, commit changes. 3) Push changes to Docker Hub 4) Write the Dockerfile corresponding to these changes. (Maybe also show putting it on Github and switch on 'automated builds' + repository links on DockerHub)

dmlond commented 9 years ago

I was definitely thinking 'do both'. I think containerized RDesktop and/or iPython Notebook systems will be how many researchers access and analyze their data. My examples would definitely be in the latter parts of the tutorial, as we move more into DevOps using the Dockerfile.

jennybc commented 9 years ago

I am very interested in having Docker explained to me. Sign me up!

hlapp commented 9 years ago

:+1:

Reproducible-Science-Curriculum / Reproducible-Science-Hackathon-Dec-08-2014

Unconference tutorial: Escaping dependency hell -- Docker for reproducible research? #11