Choose a reproducibility platform/workflow

brainwane commented 6 years ago

Right now, to store and share replications and reproductions, we use a special-purpose repository called "REMARK: Replications and Explorations Made using the ARK". We're investigating whether we should stay with that or choose a different approach.

@llorracc has asked our NumFOCUS peers what they're using and recommending, and I've asked peers within The Carpentries the same question.

Let's share notes here in this GitHub thread and make the decision here.

llorracc commented 6 years ago

Good idea to use a GitHub thread for this -- since GitHub itself is likely to be an integral part of whatever the answer will prove to be.

One other point is that we will need not only to choose a technology, but also:

Create some working examples of the use of that technology
Create materials for teaching students to use the technology
- This may mostly involve a curated set of links to existing materials

llorracc commented 6 years ago

So far (2018-08-04) the numfocus discussion has surfaced two candidates that we had not found before:

Sumatra

The mission statement is spot-on for what we are trying to achieve. But we are not sure how active the project is, or whether it aims to expand to fields other than its home in neuroscience.

Datalad’s aim is to record virtually every step in a whole neuroscience research project (code, data, writing, etc). For a team of people with a complete grasp of Datalad and a lot of experience with it, this sounds like it might be a great tool. It could be adapted to our narrower purposes, but how easily is unclear.

We've also looked at less specialized tools including:

but in our quick assessment none of them looked to have the combination we are seeking of being:

Lightweight (don't ask too much of the authors)
Future-proof (if the project dies, is the archive still usable)
Focused on our goal
- Not just archiving code, but ensuring that it runs and will continue to be runnable indefinitely

VickyRampin commented 6 years ago

I believe ReproZip is on-mission for you! It takes 2 steps for folks to bundle their work with ReproZip, and ReproZip not only captures all data and code, but the environment in which everything runs is captured. ReproZip gets everything needed to rerun someone's work and creates a single distributable bundle of it. You can see examples at: https://examples.reprozip.org

I actually wrote a paper about how since the ReproZip format is generalizable, it's great for archival purposes (I'm a librarian trained in digital preservation): https://iassistquarterly.com/index.php/iassist/article/view/18

ReproZip has unpackers (docker, vagrant), but doesn't rely on them to rerun the bundle. We are actually adding singularity as an available unpacker now!

Anyway, sorry to butt into your issue/convo, just wanted to add some additional info!

Edited to add: we also have a way to unpack & interact with ReproZip bundles in the cloud, a tool called ReproServer. In an author-reviewer scenario, the author uses ReproZip to make a bundle of all the dependencies and workflow to rerun their work correctly in the original environment in which they work. Then, the author can either put their RPZ file in a repository (OSF, figshare, etc) or send in the RPZ file with their paper. If the author sends the RPZ file, the reviewer can upload the RPZ into ReproServer and rerun it. If the author uploads the RPZ into a repository, they just need to give the link and the reviewer can pass the link to ReproServer and it'll grab it to reexecute the work within.

mnwhite commented 6 years ago

That sounds promising.

shaunagm commented 5 years ago

Closing this, it's closely related to Overark's issue #5 so we can discuss it there.

econ-ark / HARK

Choose a reproducibility platform/workflow #188

Sumatra

Datalad