RMI-PACTA / docker

Docker images
https://2degreesinvesting.github.io/docker/
MIT License
1 stars 0 forks source link

Extract install.R to reuse it from elsewhere #55

Closed maurolepore closed 3 years ago

maurolepore commented 3 years ago

I'm developing a lightweight computing environment for PACTA_analysis and I would like to reuse the R packages we install here, independently form all other stuff (like setting options).

This PR extracts the call to install.packages() into the file docker/r-packages/install.R.

To keep the repo working at all times this PR only adds the file docker/r-packages/install.R but does not touch Dockerfile.

If this PR is merged, then I plan to follow up with another PR that changes Dockerfile, replacing this:

    && Rscript -e "install.packages( \
             c( \
               'assertthat', \
               'bookdown', \
               'config', \
               ...
               'zoo', \
             ) \
           )" \

with this:

  && Rscript -e "source('https://raw.githubusercontent.com/2DegreesInvesting/docker/master/r-packages/install.R')" \

or with this:

COPY install.R install.R
RUN ...
  && Rscript -e "source('install.R')" \
  ...
maurolepore commented 3 years ago

Thanks CJ. I'll close this PR to let @jdhoffa be part of the process, as him and @AlexAxthelm recently expressed interest in improving our Docker infrastructure.

In any case, here are my answers to your questions/comments:

I don't really understand the purpose of this.

The purpose is to describe the R packages we need independently from everything else. I note my approach here is already too prescriptive, as it describes not only which packages to install but also how to install them (with install.packages()). A more pure approach may be to list the packages alone, e.g. in a DESCRIPTION file or a .json file as renv does.

Isn't the fundamental idea of a Docker script that you can document a complete "recipe" for the computing environment you want in one script/place? So, extracting part of that and putting it elsewhere seems antithetical, no?

I've seen Dockerfiles in the wild that call external scripts. I like that approach if it makes the image more reusable, or the Dockerfile more readable.

cjyetman commented 3 years ago

thanks for the explanation.... I guess part of it is that I'm not sure what determines which packages that are needed here... if it's the PACTA_analysis repo, or even more specifically the transitionmonitor.com Docker image, than maybe it makes sense to have the dependent packages documented in some machine readable way over there... but I was a bit confused about having a Docker file and an R script like this side-by-side

maybe I interpreted "reuse it from elsewhere" wrong, and that's meant more to mean you want that part of the dockerfile (the install packages bit) to be accessible and easily useable from somewhere else... like some other repo could source this R script remotely? 🤷🏻

AlexAxthelm commented 3 years ago

I see a few different ways this could go (roughly in order of my preference):

  1. Define dependencies as part of DESCRIPTION, and let R's package management handle everything
  2. Define dependencies in the Dockerfile, and switch to devtools::install_version() to pin package versions
  3. Use a script like this to define dependencies, preferably with version pinning (in the dockerfile)
  4. Define dependencies in dockerfile using install.packages
cjyetman commented 3 years ago

I see a few different ways this could go (roughly in order of my preference):

  1. Define dependencies as part of DESCRIPTION, and let R's package management handle everything
  2. Define dependencies in the Dockerfile, and switch to devtools::install_version() to pin package versions
  3. Use a script like this to define dependencies, preferably with version pinning (in the dockerfile)
  4. Define dependencies in dockerfile using install.packages

Have you considered renv or packrat as an option?

jdhoffa commented 3 years ago

Have you considered renv or packrat as an option?

These options solve a problem that docker also solves, so I think they would be redundant, no? Unless the suggestion is to abandon docker in favour of renv or packrat.

jdhoffa commented 3 years ago

Related, I see many dependencies documented here. I assume these are dependencies for all of PACTA_analysis, create_interactive_report, r2dii.climate.stress.test, is that more or less correct?

Are the imports in PACTA_analysis/DESCRITION up to date? I think we should try to only solve this problem once haha

maurolepore commented 3 years ago

Have you considered renv or packrat as an option?

I think renv (packrat before) helps manage only part of the dependencies: R packages only. Docker helps manage any system dependency. So in my opinion renv alone is insufficient, Docker alone sufficient, and Docker + renv a bit complicated -- based on what I read here: https://rstudio.github.io/renv/articles/docker.html

Here is renv itself explaining its scope:

While renv can help capture the state of your R library at some point in time, there are still other aspects of the system that can influence the runtime behavior of your R application, {For example} the operating system in use. Docker is a tool that helps solve this problem through the use of containers. --https://rstudio.github.io/renv/articles/docker.html

AlexAxthelm commented 3 years ago

Looking through the dependencies for the packages in question here, I don't see any for non-R utilities (such as odbc, which is required by DBI and friends, for example), but I don't know about our other repos. I thiink I prefere docker over renv in case we do introduce such a dependency on one of our projects in the future.

My overall goal is not just to stabilize, but to standardize our workflows, so that we don't have to worry about "how does this project manage dependencies", and I think docker gives us a "least common denominator" in that we can define every part of the environment. (@cjyetman issues with host machines, like we ran into with Constructiva and case-sensitive file systems can be accounted for, now that we know that they're an issue).

cjyetman commented 3 years ago

Related, I see many dependencies documented here. I assume these are dependencies for all of PACTA_analysis, create_interactive_report, r2dii.climate.stress.test, is that more or less correct?

I think this is important, and getting back to a core question that I asked above... what determines the dependencies here? Is it PACTA, pure-PACTA, offline PACTA, online PACTA, PACTA and friends, the transitionmonitor.com Docker image, a desire to also include additional software like RStudio Server for the benefit of users like Mauro, a desire to use a dev version of an R package that will require compilation because someone really likes a fancy new feature it has, some combination of those?

If it's "pure PACTA", then there are certainly no special dependencies beyond a handful of R packages. Any fancy dependencies that I'm aware of are a condition of special use cases, like making a PDF, hence some of the Latex stuff here, because the original purpose of this was to prepare an environment specifically to be used on transitionmonitor, and eventually building a PDF became an unfortunate necessity there.

jdhoffa commented 3 years ago

Hmm, I don't have an answer to that personally, cause I'm not actually trying to "do" any particular use-case haha, I just wanted to know what use-case defined this list of packages

But I also feel like I'm hijacking this thread a bit, so I'm gonna tap out.

As you were!

cjyetman commented 3 years ago

Hmm, I don't have an answer to that personally, cause I'm not actually trying to "do" any particular use-case haha, I just wanted to know what use-case defined this list of packages

the answer to your question: the Docker image that needs to run on transitionmonitor.com... at least that was its original intent

cjyetman commented 3 years ago

the answer to your question: the Docker image that needs to run on transitionmonitor.com... at least that was its original intent

realising now that some more detail on this might be useful....

The dependencies (R pkgs and otherwise) installed by this Docker file are determined by a need to:

  1. run the PACTA process, i.e. web_tool_script_1.R and web_tool_script_2.R through to line 211 where the stress testing stuff starts
  2. run the stress testing code, as triggered by the last few lines of web_tool_script_2.R
  3. run the code in the create_interactive_report repo, as triggered by the code in web_tool_script_3.R, which includes generating the interactive report and generating an "executive summary" PDF (this is the only part that requires the Latex dependencies)

To be honest, there may be a few other developer related dependencies in there that are not strictly needed for the above steps (e.g. testthat), as well as a few dependencies that may not actually be needed anymore (e.g. highcharter).

jdhoffa commented 3 years ago

Ok awesome, that's helpful to me. Thanks CJ

AlexAxthelm commented 3 years ago

what determines the dependencies here? Is it PACTA, pure-PACTA, offline PACTA, online PACTA, PACTA and friends...

This is one of the primary reasons that I'd (eventually) like to go to each repository having its own. That way, each repo can define its own deps without worrying about breaking others.

Then for things like TM-docker, we can eeither wrap everything together into one, with a common set of deps, or (more preferable to me), give constructiva access to a private container registry, and tell them which container tags to use.