everpub / openscienceprize

:telescope: Everpub - Making reusability a first class citizen in the scientific workflow.
Other
69 stars 20 forks source link

Thoughts and questions on a first thorough review #18

Closed ctb closed 8 years ago

ctb commented 8 years ago

For continuous integration, we need some indication of what success is to be built in. Is that "zero exit code" or can we put in assertions of some sort?

Konrad Hinsen clearly has some thoughts on composability

We shouldn't tie things to mounting local directories because they don't work with most docker-machine types (see my approach with data volumes]. For a demo or prototype, of course it's ok :)

I really like this concept for some reason: "web based way to create an environment, try it and then download it".

Main reaction: we need to narrow down to some sort of hard focus for the OSP application, around which we build a fairy castle of air that spells out all the awesome things that could be done.

betatim commented 8 years ago

On Mon, Feb 22, 2016 at 9:28 PM C. Titus Brown notifications@github.com wrote:

For continuous integration, we need some indication of what success is to be built in. Is that "zero exit code" or can we put in assertions of some sort?

We should start with a script that runs paper.ipynb and tells you it that was a success or not. Maybe it could also run other notebooks found in the top level directory?

Konrad Hinsen clearly has some thoughts on composability http://ivory.idyll.org/blog/2016-mybinder.html#comment-2520035392

I think a lot of people think "notebooks for everything" when you mention them. I think it should be more a plumbing vs porcelain approach. The notebook explains how and why and the story of driving your analysis. Large parts of that analysis code will be in big libraries (http://root.cern.ch, numpy, and friends), or 'library' code for your analysis but the story can be written down in a notebook with the few salient function calls.

(There is also some work by guys from IBM to import notebooks into notebooks, not sure yet what I think of that)

We shouldn't tie things to mounting local directories because they don't work with most docker-machine types (see my approach with data volumes http://ivory.idyll.org/blog/2015-transcriptomes-with-docker.html]. For a demo or prototype, of course it's ok :)

Did we write mount locally somewhere? The only use case I can think of is when I run the analysis on my laptop/desktop. I'd like to mount the git repo into the container so I can edit it from the outside (with my favourite emacs).

I really like this concept for some reason: "web based way to create an environment, try it and then download it".

Main reaction: we need to narrow down to some sort of hard focus for the OSP application, around which we build a fairy castle of air that spells out all the awesome things that could be done.

:+1:

I see two things to focus on: educational material (how to make these executable papers) or build the infrastructure needed to host a 'journal' which shows them off. Despite thinking that the education part is the bigger problem, my naive guess would be that the infrastructure part is more grant-y.

ctb commented 8 years ago

On last point - how about connecting with gigascience, or gitxiv; and we can write in connections to software and data carpentry although serious lesson dev would be out of scope and budget for first round, I think.

betatim commented 8 years ago

I have Gigascience or PeerJ (or PLOS??) on my list of publishers that should be up for this but I unfortunately have no contacts there what so ever. Do you know someone who could make an introduction?

I could potential be introduced to someone from the arxiv but they seem extremely busy and hence quite conservative towards new-new ideas.

Conclusion: focus on the web app/tools part. Lessons and education as nice to haves. (please disagree if you do)

ctb commented 8 years ago

I'm on Ed board at GigaScience and PeerJ CS.

JackDapid commented 8 years ago

I know somebody from Springer nature - I think they own Gigascience - at least last time I checked. can try to contact them as soon there is a first distributable version

odewahn commented 8 years ago

O'Reilly also has some connections to PeerJ -- who would you be trying to reach there?

JackDapid commented 8 years ago

Another idea we are working on - a science hackathon: I would love to facilitate that papers written there could try the "dynamic & interactive" way - very early proposal: https://docs.google.com/document/d/1HwiQxyVG1CnW6AUbFQ-0yMT-BYHi7MnVzXCKSId-xXg

betatim commented 8 years ago

@odewahn not sure who you'd want to contact. Educated guess: editors. They are scientists themselves so could be excited by the prospect, if they like the idea they can champion it within the publisher.

ctb commented 8 years ago

Oh, and we could talk to biorxiv, too. I don't think there's a shortage of publishers that would be interested.

But, this leads in another interesting direction - one of the big concerns I see from the perspective of publishers and librarians is that the technology and formats are changing very fast, so it's not at all clear that in (eg) 5 years we will be able to run today's Jupyter Notebooks inside of Docker. Perhaps part of our proposal could focus on doing something about that in the next year - it's probably too early to build standards, but defining the minimal ingredients could be useful at this point.

While I'm randomly brainstorming, any thoughts on bringing the R community (which is QUITE large in bio and biostats) technology into the fold here? I have some experience with RStudio and RMarkdown, less with Shiny.

ctb commented 8 years ago

p.s. I can broker introductions with many journal editors. We should figure out what we want to say rather than worrying too much about who to say it to :)

ctb commented 8 years ago

(I'll summarize all of these at the end of this, but while I'm on a roll ;)

The integration with TravisCI and other cont integ services is particularly nice with pull requests. One thing that I have yet to see is integration of continuous integration & pull requests on paper pipelines - this could be valuable for both collaboration and review.

betatim commented 8 years ago

Is it realistic to get one of the publishers to "endorse" this proposal on the time scale of Feb 28th? It for sure would make the proposal stronger. How quickly they could make a decision and public statement probably relates to what we ask from them. We should discuss this in #22 or if we think that even for the minimal ask they won't be able to converge before we submit I would punt this to after submission.

re: minimal ingredients, in my world we use a paper.md which contains markdown plus code blocks, that is the "paper". We will always be able to read that and rerun it. The tool to do so today that I'd use is the jupyter infrastructure of kernels. This exists as gistexec, interactive posts, and Rmarkdown. Not sure yet about docker ... or how to replace it. However fairly confident that we will always be able to convert a Dockerfile to the-new-big-thing. And as lots and lots of people with a lot of cash use it, chances are someone will create the tool for us.

Having the flexibility of a paper.md or a paper.pynb as the executable paper should make it easy to get the R crowd on our side. Paging Dr. @rgbkrk have you tried feeding Rmarkdown to gistexec? Creating Rmarkdown from RStudio seems quite straightforward (never tired but watched people do it). is this something people do? One exercise left for the reader would be to work out how to make RStudio run stuff in a docker container.

betatim commented 8 years ago

What is a paper pipeline for you @ctb? For me: a git repository that contains all the code required to produce a paper as well as a Dockerfile to create the environment in which it runs. It contains a script or text file describing to you what commands to type in what order (ideally it would be a Makefile).

The workflow then goes something like this:

  1. Tim develops cool new colour scheme for plots
  2. (re)runs locally to check it works
  3. looks at locally made plots/figures/tables
  4. git commit
  5. create PR
  6. CI runs it and says "yes works"
  7. CI uploads plots or other "build artefacts" somewhere for later use/inspection

To share the latest PDF of the paper we point people at www.build-artefacts.com/betatim/icecream-prefs/latest where they get the latest output of the CI run.

rgbkrk commented 8 years ago

Gist exec handles R Markdown in the most basic of ways.

ctb commented 8 years ago

Agreed on def'n, provisionally :)

What about building a simple specfile that automates the bit of mybinder where you have to tell it whether to look at requirements.txt or a Dockerfile, and expanding to specify whether we should run RMarkdown, Jupyter, or blah, and make, snakemake, or pydoit?

More generally (and this is not well thought out) how about working towards a base Docker image that contains all the relevant software installs, and combining that with a specfile that says "here is what to run, here is our guestimate of compute resources required, and here is where the interesting output will reside - data files, PDF, etc."?

And then implementing that?

--titus

betatim commented 8 years ago

Crossing one thing off my list as done:

ctb commented 8 years ago

Follow-on to previous comment - this specfile could then be used in composition of workflows.

I think the idea of (specfile + demo implementation + exploring composition) could be a nice circumscribed proposal to the open science prize. Thoughts?

betatim commented 8 years ago

That get's quite close to http://bioboxes.org/ no?

Some thoughts on this in #16

ctb commented 8 years ago

Same idea as bioboxes, different intent and interface ;)

tritemio commented 8 years ago

Regarding composability of notebooks, it can be done with a "master" notebook (the main narrative) calling other notebooks, optionally passing parameters. There is a tiny function I wrote for the purpose:

https://github.com/tritemio/nbrun

and a more advanced implementation from @takluyver:

https://github.com/takluyver/nbparameterise

So the paper.ipynb can optionally be a master notebook executing other notebooks for the various macro-steps of the analysis. This more or less solves the dependency problem regarding the notebooks.

For software dependencies, I think specifications of "conda environments" (including versions of each package) can help to be able to rebuild the "software environment" in the years to come (assuming Continuum does not deletes the packages of old software from their archives, but this is unlikely). Conda covers both python and R packages and other basic libraries. Also environments specifications are purely declarative YAML files (as @ctb suggested). I think using a conda environment inside docker would be a great solution.

khinsen commented 8 years ago

Some thoughts concerning composability, which is actually the core issue of this project.

There are three points of view concerning composition: science, communication, and technology.

In terms of science, an executable paper is composed of ingredients such as models, methods, experimental data, fitted parameters, etc. The details very much depend on the kind of science one is doing. Reusability requires that each ingredient can be replaced by a different one easily.

In terms of communication, an executable paper is composed of new material and prior art, to which the new material refers.

In terms of technology, we have to deal with the huge mess that we have piled up over a few decades. A ready-to-execute paper is composed of an operating system, compilers, linkers, interpreters, containers, servers, databases, individual datasets, libraries, middleware, software source code, and of course explanations for human readers. Maybe I have forgotten something.

The challenge is to align these different points of view in order to get something useable. We need to compose technological artefacts in such a way that we can communicate the science in a way that is understandable and reusable. That is in my opinion the ultimate goal of this project.

Ideally, we would have a single kind of technological artefact that is inherently composable. Procedures in a programming language are such artefacts: we can make a procedure that calls a few already existing procedures. Dynamic libraries are also composable: we can make a dynamic library that calls code from a few other dynamic libraries. Binary executables are composable with more effort: we need to write glue code in order to produce a binary executable that calls other binary executables. To compose different kinds of artefacts into a whole, we have to do messy interfacing work. Most of the hard problems in computing are related to composing artefacts that were not designed for being composed: packaging, portability, deployment, dependency hell, DLL hell, software rot, and many more. Composition is the #1 source of accidental complexity.

Now let's look at the technologies mentioned here, from the point of view of composability.

As far as I know, Docker containers are not composable, though I may be wrong. It doesn't sound impossible in principle to make a container out of three existing containers, but I haven't seen it done. If containers are not composable, there can only be one container in an executable paper.

BTW, there is an alternative approach that is composable: packages as defined by Nix or Guix (two implementations of the same concept). Much more promising than containers, in my opinion. Also less popular, because less convenient for software deployment. But our problem is different from software deployment.

Notebooks are not composable. You cannot combine two notebooks into a larger notebook, nor into any other useful entity. More importantly, you cannot call code in one notebook from another notebook. That means that notebooks are not reusable either. At best, reuse means that only a small part of a big notebook must be modified in order to do a different computation.

Mybinder or Everware compose an environment implemented as a container with a collection of independent notebook into a publishable package. That package is not composable with anything else. On the other hand, this composition aligns very well with the communication aspect: the environment contains the prior art, and the notebooks contain the new stuff. Moreover, it's acceptable that the prior art is not so explorable by the user, as it has presumably been published and explained before.

That leaves the question of how to package the "new stuff" in such a way that its individual scientific components are (1) reusable and (2) explained to human readers. Software libraries offer (1) but not (2), and are restricted to code. Notebooks offer (2) but not (1). They can contain code and small datasets. Independent datasets would be a straightforward addition, so data isn't really the problem.

Traditional literate programming, as introduced by Knuth, looks like a promising way to integrate code with a human-readable explanation of the science, in a composable way. Unfortunately, it doesn't compose with notebooks into a coherent human-readable document.

In summary, what this project really is about is to compose different technologies in such a way that they permit the construction of executable papers by composition of reusable components.

khinsen commented 8 years ago

@tritemio Nbrun looks interesting. Can you compose notebooks recursively using this technique? In other words, can you treat a notebook like a procedure that can call other procedures?

tritemio commented 8 years ago

@khinsen we were probably writing the comment at the same time. I agree with your analysis. For me conda covers most use cases. What's your take on that?

Also, a simple composability of notebook is possible with the concept of "master" notebook and "template" notebooks (see nbrun link) that act like functions. It is not as flexible and general as calling a real function but for the macro-steps on the analysis with few parameters it works fairly well (and you have links to go back and forth between master and template notebook if you want to dive into the details).

As an example, I recently used the following pipeline:

  1. A number of template notebooks: these accept input parameters, do normal analysis/plots and save the important results in CSV.
  2. A single "master" notebook executes all the template notebooks with all the input parameters that are necessary
  3. A "summary" notebook loads and plot/represents the results.

Notebooks are inter-linked for easy navigation.

@khinsen, to answer you last question, yes this procedure can be repeated (a template notebook can call other notebooks with or without parameters).

tritemio commented 8 years ago

@khinsen In principle you can build a complex dependency "graph" but when you use multiple layers you cannot easily "see" the full dependency graph looking only at the master notebook (like when you call a function you don't know how many subfunctions are also called).

khinsen commented 8 years ago

@tritemio Conda is fine for what it contains. For many Python-based projects it's probably good enough. But if you don't use Python, or if you need to compile your own extension modules, then conda starts to be as much of a problem as it is of help. In particular on MacOSX, where you need a very peculiar Apple SDK installation if you want to link to libraries supplied by conda.

Euhh... I just noticed that you wrote "conda" but not "anaconda". Conda on its own is just a build and deployment tool. I wouldn't want to package all my software from scratch using conda!

betatim commented 8 years ago

On Tue, Feb 23, 2016 at 7:54 PM Konrad Hinsen notifications@github.com wrote:

As far as I know, Docker containers are not composable, though I may be wrong. It doesn't sound impossible in principle to make a container out of three existing containers, but I haven't seen it done. If containers are not composable, there can only be one container in an executable paper.

I think the best you can do is mount a container inside another. For this to work the containers probably would have to have been designed to be used together like this. Then there is bioboxes where you treat each container as a blackbox. I am not sure I like this approach.

BTW, there is an alternative approach that is composable: packages as defined by Nix http://nixos.org/ or Guix http://www.gnu.org/software/guix/ (two implementations of the same concept). Much more promising than containers, in my opinion. Also less popular, because less convenient for software deployment. But our problem is different from software deployment.

I am not so worried about the fact that I can not automatically merge the environments of two separate executable papers. I would posit that a successful automatic merge is only possible in a small fraction of cases. In the majority you would need a human to decide how to resolve conflicting versions of the same package or their dependencies.

So the fact that you have to have a human read each Dockerfile, think about it and create (by hand) a third one that is the merger is not a big practical downside.

Notebooks are not composable. You cannot combine two notebooks into a larger notebook, nor into any other useful entity. More importantly, you cannot call code in one notebook from another notebook. That means that notebooks are not reusable either. At best, reuse means that only a small part of a big notebook must be modified in order to do a different computation.

In addition to nbrun there is work going on by guys from IBM https://github.com/jupyter-incubator/contentmanagement

Mybinder or Everware compose an environment implemented as a container with a collection of independent notebook into a publishable package. That package is not composable with anything else. On the other hand, this composition aligns very well with the communication aspect: the environment contains the prior art, and the notebooks contain the new stuff. Moreover, it's acceptable that the prior art is not so explorable by the user, as it has presumably been published and explained before.

It is somewhat composable, you can treat it as a blackbox which has zero inputs and produces some output. Which is a little better than nothing at all, but not much.

That leaves the question of how to package the "new stuff" in such a way that its individual scientific components are (1) reusable and (2) explained to human readers. Software libraries offer (1) but not (2), and are restricted to code. Notebooks offer (2) but not (1). They can contain code and small datasets. Independent datasets would be a straightforward addition, so data isn't really the problem.

I disagree, libraries are (1) and (2), if the maintainers bother to write the documentation. (I use library to mean a contained bit of code that lots of people use (like scikit-learn, glibc, ROOT,...) Not a shared library.so, no one should use those unless they have the source.

(Below I use notebook as a place holder for any narrative+code document, ipynb, Rmarkdown, ...)

In my experience only those with a wish for insanity create notebooks longer than a few hundred lines of code. It quickly becomes unwieldy. The builtin editor is not up to scratch compared to emacs/vim/atom. What ends up happening is that people explore ideas using a notebook and then create a plain .py or .R file which contains the end result of the exploration. Over time all the code forms a library for this paper. There is often only one or two notebooks that then use this code. The library contains all the plumbing and the notebook drives it. Connects the high level commands with narrative and displays the results of the research. Maybe it does some small calculations right then and there.

So I think of the paper.{md,ipynb} as the cockpit from which you control the analysis and have shiny instruments informing you about the state of the plane. If you want to know how the fuel gauge works, you take of the panel and follow the cables down. Just like you follow a function call or shell script invocation to find out what it really does.

I think finding the right balance for a complicated problem like this will only be possible by proposing a solution, building it, using it, finding out why it sucks, and starting a new one. Then iterating a few times. In the spirit of "most code is written so it can be deleted" ;)

I can think of several uses cases from LHC which won't work with what is proposed here. For example you could ask is a paper truly reusable if I don't also provide you with the several CPU years it takes to actually run it from start to finish? What if the data is so large that it can only be accessed from machines (close to the data) to which only CERN users have access? For a first attempt at building something like this we should not get distracted by the reasons why it will never work, but focus on the reasons why it will work.

khinsen commented 8 years ago

@tritemio I am having second thoughts about Nbrun. You use the terms "template" and "macro", so I wonder if nbrun runs sub-notebooks in a separate namespace. If not, then that's not proper composition because there is no well-defined interface between the components. A dangerous source of bugs.

@betatim I fully agree that only experience will tell what works and what doesn't. But it does help to do some brainstorming about possible difficulties in advance.

The only point on which I disagree with what you say is that a documented library is good enough as an explanation of a new model or method in an executable paper. Library documentation is reference style, organized around the code. It explains how the code does something, but it doesn't explain the motivations for doing things, nor the concepts required for understanding new science. You could of course add such material to a library documentation, but that's not where it belongs. It belongs into a narrative specifically written for explaining things. That was Knuth's idea with literate programming.

A traditional paper has a section "materials and methods" and a section "results". They belong together and reference each other. It's no good to have "materials and methods" in library documentation and "results" in notebooks. That's a bit like a traditional paper saying that "a description of the methods is available from the authors upon request". A barrier between methods and results that prevents understanding.

betatim commented 8 years ago

On Wed, Feb 24, 2016 at 9:32 AM Konrad Hinsen notifications@github.com wrote:

@betatim https://github.com/betatim I fully agree that only experience will tell what works and what doesn't. But it does help to do some brainstorming about possible difficulties in advance.

Many :+1: on this point. You need feedback loops everywhere and I think we are doing an Ok job here attracting brains to give feedback and then discuss! Felt like pointing out that we should not fall into the trap of "oh this will never work", because I see so many good ideas derailed by that. Just too easy to think of reasons why something won't work ;)

The only point on which I disagree with what you say is that a documented library is good enough as an explanation of a new model or method in an executable paper. Library documentation is reference style, organized around the code. It explains how the code does something, but it doesn't explain the motivations for doing things, nor the concepts required for understanding new science. You could of course add such material to a library documentation, but that's not where it belongs. It belongs into a narrative specifically written for explaining things. That was Knuth's idea with literate programming.

I need to refresh my memory of Knuth's literate programming a bit. Right now I am undecided on whether having docs/moduleA.md contain the narrative documentation for code/moduleA.py is good or bad. Or if it would be better to have literate/moduleA.lit from which we generate the code and the docs. IMHO the big challenge is to get authors to write any kind of narrative docs. Maybe my expectations are too low so that I am happy with any form of narrative docs. (The API docs belongs in the code and then we generate a nice HTML/PDF/.. from it)

A traditional paper has a section "materials and methods" and a section "results". They belong together and reference each other. It's no good to have "materials and methods" in library documentation and "results" in notebooks. That's a bit like a traditional paper saying that "a description of the methods is available from the authors upon request". A barrier between methods and results that prevents understanding.

Yes. Though we could envision having a paper that says "a description of the methods/code is available in this hyperlink:sub-document" which is delivered with the paper because it is in the same repository. Just a different document.

khinsen commented 8 years ago

@betatim The question of how to divide the information into files should probably be left to experimentation, and even remain flexible in the long run, to accommodate a maximum of tools and habits. There are various literate programming tools out there, but there are also people who prefer code and comments in separate files. What matters to me is that our tools should not discourage us from writing good explanations - as you say, the hard part is convincing people to actually do it.

tritemio commented 8 years ago

On Wed, Feb 24, 2016 at 12:32 AM, Konrad Hinsen notifications@github.com wrote:

@tritemio https://github.com/tritemio I am having second thoughts about Nbrun. You use the terms "template" and "macro", so I wonder if nbrun runs sub-notebooks in a separate namespace. If not, then that's not proper composition because there is no well-defined interface between the components. A dangerous source of bugs.

The sub-notebook is always executed by a new ipython kernel so it's a different process, there is no namespace sharing. You can pass arguments that are serializable and results are written down in output files. What is not formally defined is the "notebook signature". You need to open the sub-notebook to learn which arguments you can pass. There is no introspection and there is no error checking that you are passing arguments with the right names. This checks are implemented by @takluyver's nbparametrize, so it is technically possible. Similarly, you have to look at the notebook to learn what results it saves.

Notebooks will never be (at least not easily) as composable as functions. But as an outer layer composition of macro-steps (and by macro I mean "big", high-level) they will work well IMHO.

We should use/promote the right abstractions, and at this point I would note encourage this type of notebook composition beyond 1 or 2 layers of notebooks calls (i.e. notebook which calls notebook which calls notebook).

betatim commented 8 years ago

http://cdn.emgn.com/wp-content/uploads/2015/07/Inception-Facts-EMGN3.gif

notebooks in notebooks in notebooks in notebooks

m3gan0 commented 8 years ago

Digital Science is another group you could try reaching out to. They run Figshare, Overleaf, and LabGuru - all devoted to opening up science workflows and outputs in various forms.

ctb commented 8 years ago

@m3gan0 good idea!

betatim commented 8 years ago

Do we know anyone there? If yes @m3gan0 could you post in #22 ?

ctb commented 8 years ago

I know people there, but I don't think we should ask them for an expression of interest at this point - just mention them as part of the ecosystem we hope to work with. protocols.io is another one (we could probably get an expression of interest from Lenny Teytelman quite quickly, actually).