PecanProject / pecan

The Predictive Ecosystem Analyzer (PEcAn) is an integrated ecological bioinformatics toolbox.
www.pecanproject.org
Other
202 stars 231 forks source link

Improve Docker reproducibility #2769

Open ashiklom opened 3 years ago

ashiklom commented 3 years ago

For one, I think we should implement the changes in the closed PR https://github.com/PecanProject/pecan/pull/2768. But that requires rebuilding the depends image and some additional testing that I don't have time to do right now.

Basically, I think we have three options that take full advantage of the current Docker reproducibility mechanisms:

  1. Stick with R 4.0.2 and accept that our packages will be a few months out of date.
  2. Bump to R 4.0.3 and accept using the latest CRAN versions of all packages (until R is bumped to 4.0.4, when this image will be frozen).
  3. Instead of starting from the rocker/tidyverse image, build R in our pecan/depends image from scratch (based on the existing template), and manually specify (by modifying the CRAN URL) exactly which R package snapshots we want to work with.

We also have a few hacky options (like manually updating specific packages to specific versions and/or adding a bit of text editing code into our Dockerfile to update the image's repos definition in .Rprofile). But, my personal favorite is (3) above: Doing it is much easier than it sounds (the parent Dockerfile is pretty small, I think because the parent Ubuntu 20 image has all the dependencies already), and gives us maximum control over package versions. The ideal system would be to build 2-3 version of the depends image based on different snapshots ("stable", "next", "latest"?), test against all of them, and periodically bump what is considered "stable" and "next" whenever we're comfortable.

robkooper commented 3 years ago

I would like to use the fixed versions of R, like 4.0.2 with a fixed cran repo, that way we have a consistent build.

We can even tell people to use when working with R on their local machine.

robkooper commented 3 years ago

4.0.2 => ENV CRAN=https://packagemanager.rstudio.com/all/__linux__/focal/344 4.0.3 => ENV CRAN=https://packagemanager.rstudio.com/all/__linux__/focal/latest

Should we build 4.0.3 and see if it is ready for the future?

ashiklom commented 3 years ago

+1 to building against 4.0.2 and 4.0.3 -- that's a great addition.

4.0.2 => ENV CRAN=https://packagemanager.rstudio.com/all/__linux__/focal/344 4.0.3 => ENV CRAN=https://packagemanager.rstudio.com/all/__linux__/focal/latest

My point about (3) above was basically that there's a lot of middle ground between these two options that the default R images don't accommodate. The only snapshots we can pick from by default are the ones immediately before the following R patch release, which forcibly ties our package updates to R's release schedule. Based on this table, looks like they release about 3-4 times a year on average. In practice, that means that 4.0.2 precludes us from using any package updates more recent than October 2020. That's probably fine, as long as folks try to avoid using bleeding edge package features in PEcAn (I'm quite guilty of this...).

My suggestion with (3) was that we could choose to be more nimble if we so desired by directly controlling the ENV variable. The RStudio package manager makes new snapshots multiple times a week, which gives us a lot of granularity.

More importantly than the frequency of updates is that breaking changes to R packages happen independently of R patch releases. E.g., If we want to guard against breakage from a hypothetical dplyr 2.0 or testthat 4.0, directly controlling which CRAN snapshot we're using is more effective than doing it based on R versions.

But, because the Rocker Dockerfiles use the ENV variable to set the repos in the .Rprofile at build time, we would have to rebuild the R images ourselves --- we can't just change the ENV variable on pre-built images.

ashiklom commented 3 years ago

One other option (mentioned in #2779) is to use something like renv to control versions of every package individually. That gives maximum control, but would require revisions to our build system, and making sure that we always use the renv library and not the system default. That shouldn't be too difficult, but would definitely be non-trivial.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 365 days with no activity.