benmarwick / rrtools

rrtools: Tools for Writing Reproducible Research in R
Other
670 stars 85 forks source link

`rrtools` currently lacks package management, reducing reproducibility #84

Closed ntrlshrp closed 4 years ago

ntrlshrp commented 5 years ago

In brief:

Details:

A Dockerfile contains enough information to create an environment, but not enough information to reproduce an environment. Consider a Dockerfile that contains the command “install.packages(‘dplyr’)”. Following this instruction in August 2017 and again in December 2017 will result in two different Docker containers, since the current version changed. -RStudio and Docker

ntrlshrp commented 5 years ago

I plan to offer a pull request in a minute (which I hope is not considered poor behavior) but I imagine this will require some discussion. Looking forward to others' input on how to improve this (should @benmarwick wish to include this feature). Thanks.

benmarwick commented 5 years ago

Thanks for your suggestion, this is an important and challenging topic, indeed! Can you share some of your experience using packrat, e.g. link to some projects where you've used it and it's made a difference? I think it would be good to inspect some successful applications of packrat in compendium-like contexts to see how best to include it into rrtools.

Our current solution to package management in rrtools is versioned Docker containers. Our Dockefile template ensures that the Docker container associated with the compendium uses the same version as the user's R version at the time the user generates the Dockerfile. A feature of these versioned Docker containers is that they install R packages from MRAN, Microsoft's versioned CRAN. This means that we get R packages from the date that that version of R was released. The RStudio document that you cite doesn't mention versioned rocker and MRAN, even those they were available in the time period mentioned in the document.

So even if you don't trust the Docker Hub, you can build the docker container at any time and repeatedly get the same R version and package versions. And those will be the same versions as when you started the compendium.

My opinion is that this is preferable to packrat because it keeps the user's compendium lightweight (we don't store the package source or binary locally like packrat does), and because rocker is very solid and reliable. I've found packrat to be highly unreliable and buggy when using it with compendia. Many of the commits on my ktc11 compendium show my struggle trying to make packrat work. My sense is that versioned rocker containers are a good solution to package management and packrat isn't necessary. But perhaps I've missed something?

Let's see what others reckon: @nevrome @annakrystalli @wolass what are your thoughts?

ntrlshrp commented 5 years ago

Thanks for the thoughtful response, @benmarwick, that helped me better understand the versioned rocker/verse containers. Given that, the RStudio document quote covered too long a time period---August 2017 was R 3.4.0 and December 2017 was 3.4.2. This helpfully makes the problem smaller, tipping the balance against packrat.

New problem statement 1: Even if local machine packages and versions match rocker/verse:3.5.3 on 2019-04-01, it may not on 2019-04-26, leading to two versions of paper.html (1a) Author's docker paper and (1b) Colleague's docker paper.

Example: Author using R 3.5.3 pushes for publication on 2019-04-01 using local versions V of packages P (R and all packages are most recent version). rocker/verse:3.5.3 publishes on 2019-04-01 with versions V of packages P (image packages plus author packages match local list). Colleague attempts to reproduce author's paper on 2019-05-01 with MRAN 2019-04-26 versions V' of packages P'. It is possible that V and V' are the same, but not guaranteed.

Details: According to rocker-org/rocker-versioned

rocker/tidyverse:3.3.1 Docker image will always rebuild with R 3.3.1 and R packages installed from the 2016-10-31 MRAN snapshot, corresponding to the last day that version of R was the most recent release.

The MRAN date for R 3.5.3 is 2019-04-26. In the 25 days from April 2 to April 26 2019, 843 packages were updated (or added) on CRAN out of 14,081 (6.0%). It looks like there have been 9 new rocker/verse:r_version images in the past 24 months, a pretty good rate of a new version every 81 days (Min: 39 days, Max: 171 days). I suppose two (small) possibilities are that, due to package updates between publication date and the MRAN date: (1) a function (or argument) is deprecated or (2) some package newly masks a function from another? Thus, I can see the weight of arguments that rocker/verse:r_version containers are reproducible enough without fragility and bloat of packrat.


Another potential argument is New problem statement 2: Local machine versions of packages may not match rocker/verse:3.5.3 on 2019-04-01, leading to three versions of paper.html (2a) Author's local paper, (2b) Author's docker paper, (2c) Colleague's docker paper.

Example: Author starts work on paper using global library of local machine on 2018-04-01. They set paper aside and come back in March 2019. Author decides to upgrade to R 3.5.3 but does not upgrade their packages and eventually pushes for publication on 2019-04-01 with versions V of packages P' (image packages plus author packages does NOT match local list). rocker/verse:3.5.3 publishes on 2019-04-01 with versions V' of packages P' (the most recent available packages). Colleague attempts to reproduce author's paper on 2019-05-01 with MRAN 2019-04-26 versions V'' of packages P. It is possible that V and V' AND V'' are the same, but not guaranteed. [The difference between papers (2a) and (2b) can be overcome if author uses docker locally to create the paper.html they review prior to pushing for publication and, upon seeing unexpected results or errors, updates code or upgrades packages or both. They may, however, never notice the discrepancies absent giant surprises.]


Personal experience: Admittedly, I have not combined Docker and packrat before (hence my enthusiasm here, trying to reach perfect reproducibility). I've used packrat on my recent bookdown project where the codebase had to survive over years and I did not wish to risk upgrading R or packages without locally testing outputs for any diffs and having a way to easily restore previous package versions. I've also used it for a project where I felt the codebase was just too big and/or fragile and I hoped to save myself headaches of broken code by freezing the state of package versions. I sometimes use packrat for smaller projects as well that I expect to persist for extended lengths of time.

Thanks again for your time on this issue, I know this was a lengthy response!

benmarwick commented 5 years ago

Thanks for your detailed comment and suggestions. Yes, I agree there could be some potential for problems due to package changes in between MRAN dates that are tied to R release dates. Can you share the URLs to your repos that use packrat?

There are a few options for setting the MRAN snapshot date more precisely spelled out in this post: https://discuss.ropensci.org/t/creating-a-package-to-reproduce-an-academic-paper/1210/21 (including a method I've used in the past).

A nice PR to rrtools might be to set the MRAN date more exactly, rather than relying on the rocker date, if it is possible to override the versioned rocker.

Your problem #2 is a tough one. Seems like renv could indeed be a good solution (as a replacement for packrat). However, I see this note of caution

Updating the Lockfile While working on a project, you or your collaborators may need to update or install new packages in your project. The workflow remains the same as before – after installing these new packages, you can share the updated lockfile with your collaborators, and request that they execute renv::restore() to synchronize their library with the lockfile.

A bit of care needs to be taken if your collaborators attempt to update packages independently. It is recommended that a single ‘source of truth’ is used for the package sources and renv.lock, to avoid different collaborators ending up with different lockfiles – or even, different versions of the project sources!

The simplest way to guard against this it to use a version control system, and have all collaborators work off the same branch. This way, if someone needs to update renv.lock in the public repository, all collaborators will see that updated lockfile and will gain access to it next time they pull those changes.

I've just tried it on a current project with @LiYingWang and it works fine on OSX, but when we try to work on the project on Windows we see this when we open the project:

Failed to find installation of renv -- attempting to bootstrap...
* Downloading renv 0.5.0-61 ... Error in utils::download.file(url, destfile = destfile, mode = "wb", quiet = TRUE) : 
  cannot open URL 'https://api.github.com/repos/rstudio/renv/tarball/0.5.0-61'
In addition: Warning message:
In utils::download.file(url, destfile = destfile, mode = "wb", quiet = TRUE) :
  cannot open URL 'https://api.github.com/repos/rstudio/renv/tarball/0.5.0-61': HTTP status was '404 Not Found'
Warning message:
Failed to find an renv installation: the project will not be loaded.
Use `renv::activate()` to re-initialize the project.

I wonder if the best use of developer and user time and effort to tackle your problem #2 is unit tests on key output (e.g. using testthat), rather than package version control?

ntrlshrp commented 5 years ago

Thanks again for the helpful contribution to the issue. Yes, the repos: https://gitlab.com/ntrlshrp/desserts (bookdown --> https://www.springer.com/us/book/9783030211257), https://gitlab.com/ntrlshrp/basic-income (worried package was fragile, so froze packages in time), and https://gitlab.com/ntrlshrp/worksamples (only two things there, but expected to add from time to time over extended period). Each of these is private, I've invited all members whose names I found above to be reporters (@benmarwick, @nevrome, @annakrystalli, @wolass---should future users desire access, please let me know), none of the projects is that pretty, but perhaps aid the discussion.

Prior to this thread, I started writing up a renv feature for rrtools actually, but ran into problems with the MRAN downloads failing and decided to set aside the "experimental" renv package for packrat. Perhaps in time renv will surpass expectations.

I had also thought the unit tests could be a useful alternative tack (as opposed to package version control), e.g., Dockerfile copies in repo, recopies paper.Rmd to paper-docker.Rmd and provides a diff of paper.html and paper-docker.html. Perhaps you had a different test in mind?

Here's my shorter problem statement: Author has R 3.5.2 (MRAN = 2019-03-11) and today their local machine uses one package behind MRAN (outOfDate at 2.3, MRAN 2.5), one package ahead of MRAN (upToDate at 1.2, MRAN 1.1), and one package nonexistent for MRAN (brandNewPackage at 1.1, MRAN --).

I guess one question (useful for making decisions?) is: (1) what fraction of rrtools users will publish papers with an up to date version of R and packages--those authors can all be helped with fine-tuned MRAN date as you noted. Then, for the rest (in the Author's position above), (2) are they a large enough contingent to worry about and (3) what will their preferences/behavior be when faced with a mismatch between paper.html and paper-docker.html?

benmarwick commented 5 years ago

I have a sense that the toolkit for package management is still in flux. It's not clear to me if packrat, renv or something else is going to be the way forward. I just saw this tweet that suggests pak might have some relevant functions: https://twitter.com/sinarueeger/status/1149793004393291776?s=19 have you tried that?

About testing, I mean at a more granular level. Such as testing for a specific function or section of analysis this is central to the work. I've become quite interested in that approach lately, as it's more focused on the science of the compendium, rather than the software engineering challenges of dependencies.

benmarwick commented 4 years ago

Let's close this for now, and keep eye out on how the technology is developing. It's not obvious what the best thing is for rrtools beyond what we already have, although imperfect indeed.