RevolutionAnalytics / checkpoint

Install R packages from snapshots on checkpoint-server
164 stars 38 forks source link

options("repos") only changed to MRAN when scanForPackages = TRUE and there is a new package to install #274

Closed markusdumke closed 4 years ago

markusdumke commented 6 years ago

First of all thanks for the checkpoint package, I am using this a lot to ensure reproducibility of my analyses!

Recently I found some (for me) surprising behaviour of checkpoint, which seems to be a bug to me.

What I thought calling checkpoint::checkpoint would do:

But the second point only seems to be TRUE if I run checkpoint with scanForPackages = TRUE and there is a new package found, which is not already installed. Else option("repos") is not changed, so install.packages will install the latest package from CRAN into the checkpoint folder. I think this is very confusing and probably has negative effects on reproducibility.

I see this code inside the checkpoint function:

if(length(packages.to.install) > 0) {
    # set repos
    setMranMirror(snapshotUrl = snapshoturl)

So repos is only changed when there are new packages to install. Wouldn't it be better to change this independently even if there are no new packages to install? Because users will still install new packages with install.packages and if these packages are installed from cran.rstudio.com the whole point of reproducibility with checkpoint is contradicted.

Here is example code to reproduce the problem:

.libPaths()
#> [1] "C:/ProgrammePAM/R-3.5.1/library"

options("repos")
#> $repos
#> [1] "https://cran.rstudio.com/"   "https://cloud.r-project.org"

checkpoint::checkpoint("2018-06-01",
                       checkpointLocation = "C:/R",
                       scanForPackages = FALSE)
#> Skipping package scanning
#> checkpoint process complete
#> ---

.libPaths()
#> [1] "C:/R/.checkpoint/2018-06-01/lib/x86_64-w64-mingw32/3.5.1"
#> [2] "C:/R/.checkpoint/R-3.5.1"                                
#> [3] "C:/PROGRA~4/R-35~1.1/library"

# repos is not changed to MRAN!
options("repos")
#> $repos
#> [1] "https://cran.rstudio.com/"   "https://cloud.r-project.org"

checkpoint::checkpoint("2018-06-01",
                       checkpointLocation = "C:/R",
                       scanForPackages = TRUE)
#> Scanning for packages used in this project
#> No file at path 'C:\Users\QXV6024\AppData\Local\Temp\Rtmpek7pGt\file344416693e26.Rmd'.
#> - Discovered 3 packages
#> Installing packages used in this project
#>  - Installing 'A3'
#> A3
#> also installing the dependency 'pbapply'
#> checkpoint process complete
#> ---

library(A3)
#> Loading required package: xtable
#> Loading required package: pbapply

.libPaths()
#> [1] "C:/R/.checkpoint/2018-06-01/lib/x86_64-w64-mingw32/3.5.1"
#> [2] "C:/R/.checkpoint/R-3.5.1"                                
#> [3] "C:/PROGRA~4/R-35~1.1/library"

# Now repos is changed to mran!
options("repos")
#> $repos
#> [1] "https://mran.microsoft.com/snapshot/2018-06-01"
martincadek commented 5 years ago

I agree with the comment above, I've been scratching my head with similar behaviour. After I've installed my packages I've realised that when I started the project again, checkpoint date doesn't get updated automatically.

I though something is wrong but it's probably the expected behaviour as suggested in the comment above.

    library("checkpoint")
# Create a checkpoint by specifying a snapshot date
checkpoint("2019-03-10", scanForPackages = TRUE) # R version 3.5.1 (2018-07-02)

Outputs:

Scanning for packages used in this project
|==============================================================================| 100%
- Discovered 8 packages
All detected packages already installed
checkpoint process complete
---
# Check that CRAN mirror is set to MRAN snapshot
getOption("repos")

Outputs: (note: I am using Open R)

 CRAN 
"https://mran.microsoft.com/snapshot/2018-08-01" 
                                       CRANextra 
            "http://www.stats.ox.ac.uk/pub/RWin" 

However, I would have expected: "https://mran.microsoft.com/snapshot/2019-03-10" as this is THE checkpoint date I've specified. Is there a rationale behind this behaviour? It would be helpful to describe it in help file.

markusdumke commented 5 years ago

Yes, I agree this is a confusing and it would help a lot if it would be clarified in the checkpoint documentation.

The second point you have to think about are the library paths where R looks for packages. checkpoint will put the path to the checkpoint library in the first place. But your normal user library is still there in the second position. This means if a package is missing in your checkpoint library (e.g. because installation failed), but it is installed in your normal user library (with any package version) it will just use it. This is also very dangerous in terms of reproducibility. So I am using now a solution similar to this:

checkpoint::checkpoint("2019-03-13", scanForPackages = TRUE)

# To change the CRAN mirror to MRAN mirror of specified date
checkpoint::setSnapshot("2019-03-13")

# Make sure that packages are loaded from checkpoint directory
library(data.table, lib.loc = .libPaths()[1])
martincadek commented 5 years ago

So I am using now a solution similar to this:

checkpoint::checkpoint("2019-03-13", scanForPackages = TRUE)

# To change the CRAN mirror to MRAN mirror of specified date
checkpoint::setSnapshot("2019-03-13")

# Make sure that packages are loaded from checkpoint directory
library(data.table, lib.loc = .libPaths()[1])

This seems like a good solution to ensure your collaborators use appropriate libraries. I'd probably even put the .lib.Paths in .Rprofile of the project as suggested here for example. Right now I've decided to just lazily use what you suggest above assign(".lib.loc", .libPaths()[1], envir = environment(.libPaths)) but edited it to assign the path in .libPaths as the only path in the current environment. Probably safe to check that .libPaths() is still set correctly but it saves time. Maybe this could be implemented in checkpoint as setLibrary (to complement setSnapshot). This would assume all packages are in checkpoint lib though.

vspinu commented 5 years ago

Any news on this?

One currently needs a hole set of workarounds to make it work as advertised.

This is what I have currently:

  snapshot <- "2019-11-01"
  # set it by default; otherwise pinging takes ages
  options(checkpoint.mranUrl = "https://mran.microsoft.com/")
  # Scanning takes ages (due to slow url checks), but we need to scan if the
  # repo doesn't exist
  # https://github.com/RevolutionAnalytics/checkpoint/issues/281
  do_scan <- !snapshot %in% checkpoint::checkpointArchives() 
  checkpoint::checkpoint(snapshot, scanForPackages = do_scan, verbose = interactive())
  ## https://github.com/RevolutionAnalytics/checkpoint/issues/274
  checkpoint::setSnapshot(snapshot, FALSE)
hongooi73 commented 4 years ago

This should be resolved in the new v1.0 checkpoint, just pushed to master. If you want to use an existing checkpoint without installing any packages:

use_checkpoint("snapshot_date")