CredibilityLab / groundhog

Reproducible R Scripts Via Date Controlled Installing & Loading of CRAN & Git Packages
https://groundhogr.com/
GNU General Public License v3.0
78 stars 4 forks source link

Using groundhog in foreach #115

Closed ejlundgren closed 1 month ago

ejlundgren commented 6 months ago

First, thank you for this wonderful package. I am preaching to everyone I know to use it!

I am writing to see if there is any documentation about using groundhog with foreach loops, or if this is an active realm of package development. I experimented with loading libraries inside the foreach loop with groundhog, which almost appears to work, but produces random errors. In this case, I am loading already saved models (.Rds files) and performing posthoc tests on them.

I do not know how to make this issue reproducible on your end...

A simplified version of the code that works:

  nCores <- parallel::detectCores() -1 
  cl <- makeCluster(nCores)
  registerDoSNOW(cl)

  posthoc_comp_out <- foreach(i = 1:nrow(posthoc_comps), 
          .packages = c("metafor", "data.table", "dplyr",
                         "tidyr", "broom", "multcomp"),
          .errorhandling = "pass") %dopar% {

     m <- readRDS(posthoc_comps[i, ]$path)
    # Bunch of other things...

}

Loading libraries inside foreach with groundhog:

posthoc_comp_out <- foreach(i = 1:nrow(posthoc_comps), 
                              .packages = "groundhog",
          .packages = c("groundhog"),
          .errorhandling = "pass") %dopar% {

    groundhog.day <- "2024-03-01"
    libs <- c("metafor", "data.table", "dplyr",
              "tidyr", "broom", "multcomp")
    groundhog.library(libs, groundhog.day)

     m <- readRDS(posthoc_comps[i, ]$path)
    # Bunch of other things...
}

This code produces these errors for random elements of the output list, but inconsistently:

Error in read.table(file = file, header = header, sep = sep, quote = quote, : no lines available in input\n
# And
Error in if (last.check.days < min.days) return(invisible(\"\")): the condition has length > 1\n

My R version is 4.3.2, groundhog version 3.2.0, foreach version 1.5.2, doSNOW version 1.0.20, parallel version 4.3.2

urisohn commented 6 months ago

Since version 3.0 there has been quite explicit consideration of parallel processing, and is the reason why a major architectural aspect of groundhog was modified so that groundohg works by moving pkgs in and out of the default personal library (instead of calling those pkgs from dynamically set paths to additional libraries). This way, pkgs loaded with groundhog in one core are available for loading for all cores without making additional calls to groundhog. In light of this, two thoughts.

First, i think a better practice might be to leave the groundhog command outside the foreach loop and make call specifying the pkgs within the foreach something like

foreach...() data.table::read.table()

it should be more efficient and easier to read, but these are often matters of taste.

Second, i don't think the errors you are getting are the result of using groundhog or foreach, they seem instead related to the .rds files. So, a question: If you load the .rds files with a non-parallel loop for(), or if you read them using library() instead of groundhog.library() to load the required packages, does the code still produce the errors you are reporting? I think it will. But if it does not, if the errors were to only arisie with groundhog, i would give this a closer look

ejlundgren commented 6 months ago

Dear Uri,

Thanks so much for your fast response.

Just to clarify, namespacing the packages with package::function should do the trick? Does this mean I should not load packages in each core with foreach(..., .packages = c("xxx", "xxx"))?

I guess I don't know how the foreach .packages call works---whether it is shorthand for running library("data.table") in each core (i.e., not groundhog version controlled) or whether it transfers the packages in the Global Environment (i.e., the groundhog version controlled package) to each core. I hope that's clear.

And, sorry, I should have clarified. The .Rds files load just fine when I load them sequentially or when I load packages in the foreach .packages = c("xxx") call. They only fail to load (and only about 5% fail) when I called groundhog directly in the body of the foreach loop.

urisohn commented 6 months ago

Got it. Actually, in any case you should not need to call groundhog within the loop, you should run as if you were using library() instead of groundhog.library(). Have the groundhog.library() call before the foreach, then do .packages() in the foreach Something like this:

#Before the loop
 groundhog.day <- "2024-03-01"
    libs <- c("metafor", "data.table", "dplyr",
              "tidyr", "broom", "multcomp")
    groundhog.library(libs, groundhog.day)

#loop
 posthoc_comp_out <- foreach(i = 1:nrow(posthoc_comps), 
                              .packages = libs,
          .errorhandling = "pass") %dopar% {

     m <- readRDS(posthoc_comps[i, ]$path)
    # Bunch of other things...
}

If it does not work, try again with a date prior to two recent updates of data.table(). Try 2024-02-15 and if still fails 2024-01-15.

If that still fails, try this maximally similar code with non-parallel looping

 groundhog.day <- "2024-03-01"
    libs <- c("metafor", "data.table", "dplyr",
              "tidyr", "broom", "multcomp")
    groundhog.library(libs, groundhog.day)

#loop
for (1:nrow(posthoc_comps))
{    m <- readRDS(posthoc_comps[i, ]$path)
    # Bunch of other things...
}

The error is produced by data.table() so i think somehow you are using different versions of it when you put the groundhog call in the loop vs when you do it the other way.

Let's hope one of these ideas gets to it.

ejlundgren commented 6 months ago

Thanks a ton, that works! I just wanted to verify that that would load the groundhog versioned package into each core and not the default library version. Much appreciated!!