A research compendium is a way of providing to the wider community the data, methods, and analysis associated with some published paper. Ideally, anyone interested in your paper should be able to view the raw data and recreate the analysis that was used. It should include file organization, a clear separation of components/steps (data, analysis, outputs), and provide the specifications for the computing environment that was used.
I was the second author on this paper early in my graduate school career (https://www.nature.com/articles/nprot.2017.128). The computational aspects of the paper include an image segmentation pipeline (for extracting data from live cell microscopy images) and an analysis pipeline. As a protocol paper, the goal was to make the tools accessible to anyone who wants to use them. The code for the two pipelines are available in two github repos and as Docker containers. Included in one of the repos are two jupyter notebooks that walk through example data and how to use basic functions in the pipelines.
Ultimately, I am unaware of anyone actually using these repos outside of our lab. I think there are a few main challenges: One, the API and overall structure of the two repos are not well documented, meaning that it would be difficult to figure out without someone from our lab explaining it to you. The jupyter notebooks help, but are not detailed enough for use on new projects. Another challenge is that the code is split into two repositories and again would likely require some explanation as to which repository is used for what. The last challenge was not as big of an issue years ago, but the repositories are written in Python 2 and well out of date with commonly used packages (e.g. numpy, scipy, etc.). Because of this, it would be difficult for anyone else to incorporate the functionality of the pipelines with other code they would like to use.
This is actually why I'm in the middle of writing a new pipeline, which I'm hoping will be modular, stand-alone, and well documented (as well as faster and easier to use). I want this code to be useful to others, and a big reason why I'm taking this class is to learn about the best way to do that.
Very cool @sjeknic! I am trying to get the same for some of my code that is scattered among various repos. E.g. you'd need 4 repos to go from start to finish but it would be nice to have a one repo to run them all.
1.
A research compendium is a way of providing to the wider community the data, methods, and analysis associated with some published paper. Ideally, anyone interested in your paper should be able to view the raw data and recreate the analysis that was used. It should include file organization, a clear separation of components/steps (data, analysis, outputs), and provide the specifications for the computing environment that was used.
I was the second author on this paper early in my graduate school career (https://www.nature.com/articles/nprot.2017.128). The computational aspects of the paper include an image segmentation pipeline (for extracting data from live cell microscopy images) and an analysis pipeline. As a protocol paper, the goal was to make the tools accessible to anyone who wants to use them. The code for the two pipelines are available in two github repos and as Docker containers. Included in one of the repos are two jupyter notebooks that walk through example data and how to use basic functions in the pipelines.
Ultimately, I am unaware of anyone actually using these repos outside of our lab. I think there are a few main challenges: One, the API and overall structure of the two repos are not well documented, meaning that it would be difficult to figure out without someone from our lab explaining it to you. The jupyter notebooks help, but are not detailed enough for use on new projects. Another challenge is that the code is split into two repositories and again would likely require some explanation as to which repository is used for what. The last challenge was not as big of an issue years ago, but the repositories are written in Python 2 and well out of date with commonly used packages (e.g. numpy, scipy, etc.). Because of this, it would be difficult for anyone else to incorporate the functionality of the pipelines with other code they would like to use.
This is actually why I'm in the middle of writing a new pipeline, which I'm hoping will be modular, stand-alone, and well documented (as well as faster and easier to use). I want this code to be useful to others, and a big reason why I'm taking this class is to learn about the best way to do that.