ck37 / varimpact

Variable importance through targeted causal inference, with Alan Hubbard
57 stars 13 forks source link

cannot run varimpact with multicore parallelisation #20

Open DS-Rodrigues opened 2 years ago

DS-Rodrigues commented 2 years ago

Hi Chris,

It seems that for me at least multicore parallelisation is not working with varimpact, but it might be I am doing something wrong. varimpact now seems to work with my current library (including learners with different parameters) but without parallelisation it has been more than 24hours to run just 2 folds CV and it is still running... I tried to create an example, please see below:

I am using macOS Big Sur, MacBook Air (M1, 2020), 16 GB RAM, 8 cores R version 4.1.0

# dataset
my_outcome <- runif(300, min=0, max=500)
var_1 <- as.numeric(1:300)
var_2 <- runif(300)
var_3 <- runif(300, min=0, max=1)
var_4 <- as.factor(c(rep("0",30),rep("1",270)))
var_5 <- as.factor(c(rep("0",25),rep("1",275)))
my_dataset <- data.frame(var_1, var_2, var_3, var_4, var_5)
str(my_dataset)

# all libraries I am loading
library(tidyverse)
library(readxl)
library(openxlsx)
library(lubridate)
library(data.table)
library(optiRum)
library(gridExtra)
library(RColorBrewer)
library(anytime)
library(foreign)
library(ggplot2)
library(MASS)
library(Hmisc)
library(reshape2)
library(utils)
library(zoo)
library(lme4)
library(broom)
library(stats)
library(factoextra)
library(cluster)
library(pscl)
library(SuperLearner)
library(quadprog)
library(earth)
library(tmle)
library(xgboost)
library(randomForest)
library(ranger)
library(glmnet)
library(nnet)
library(kernlab)
library(KernelKnn)
library(varimpact)
library(hopach)
library(dbarts)
library(arm)
library(gam)

# Hyperparameter optimisation

# Fit elastic net with 5 different alphas: 0, 0.2, 0.4, 0.6, 0.8, 1.0.
learners_glmnet = create.Learner("SL.glmnet", detailed_names = TRUE,
                                tune = list(alpha = seq(0, 1, length.out = 5)))

# 5 configurations
learners_glmnet$names

# Random forest via ranger
learners_ranger = create.Learner("SL.ranger", detailed_names = TRUE,
                                 tune = list(mtry = c(1), ntree = c(1000), nodesize=c(1,5,10)),
                                 name_prefix = "rgr")

# 3 configurations
learners_ranger$names

# XGBoost
learners_xgboost = create.Learner("SL.xgboost", detailed_names = TRUE,
                                  tune = list(ntrees = c(500, 1000), max_depth = c(2, 4), shrinkage = c(0.001, 0.01)),
                                  name_prefix = "xgb")

# 8 configurations
learners_xgboost$names

# earth - Multivariate adaptive regression splines
learners_earth = create.Learner("SL.earth", detailed_names = TRUE,
                              tune = list(degree = c(1,2)))

# 2 configurations
learners_earth$names

# svm - support vector machine
learners_svm = create.Learner("SL.ksvm", detailed_names = TRUE,
                              tune = list(kernel = c("rbfdot", "polydot"), C = c(0.01, 0.1, 1, 10, 100)),
                              name_prefix = "svm")

# 10 configurations
learners_svm$names

# SL library:
Q_lib <- c("SL.mean", "SL.glm", "SL.bayesglm", learners_earth$names,
           learners_glmnet$names, "tmle.SL.dbarts2",
           "SL.rpartPrune", learners_ranger$names, learners_svm$names,
           learners_xgboost$names)

g_lib <- c("SL.mean", "SL.glm", "SL.bayesglm", learners_earth$names,
           learners_glmnet$names, "tmle.SL.dbarts2",
           "SL.rpartPrune", learners_ranger$names, learners_svm$names,
           learners_xgboost$names)

library(future)
plan("multisession")

vim <- varimpact(Y = my_outcome, data = my_dataset, Q.library = Q_lib, g.library = g_lib, family = "gaussian",
                 adjust_cutoff = NULL, V = 2)

As a result I get:

Finished pre-processing variables.

Processing results:
- Factor variables: 1 
- Numeric variables: 3 

Estimating variable importance for 1 factors.
Error estimating g using SuperLearner. Defaulting to glm
Error estimating g using SuperLearner. Defaulting to glm
Error estimating g using SuperLearner. Defaulting to glm
Error estimating g using SuperLearner. Defaulting to glm

Estimating variable importance for 3 numerics.
Error estimating g using SuperLearner. Defaulting to glm
Error in training_estimates[[bin_j]] : subscript out of bounds
In addition: Warning messages:
1: glm.fit: algorithm did not converge 
2: glm.fit: fitted probabilities numerically 0 or 1 occurred 
3: glm.fit: algorithm did not converge 
4: glm.fit: fitted probabilities numerically 0 or 1 occurred 
5: glm.fit: algorithm did not converge 
6: glm.fit: fitted probabilities numerically 0 or 1 occurred 
7: glm.fit: algorithm did not converge 
8: glm.fit: fitted probabilities numerically 0 or 1 occurred 
9: `funs()` was deprecated in dplyr 0.8.0.
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated. 

I closed my R session and started again and obtained the same result via "snow":

library(RhpcBLASctl)
library(future)

cl = parallel::makeCluster(get_num_cores())
plan("cluster", workers = cl)

vim <- varimpact(Y = my_outcome, data = my_dataset, Q.library = Q_lib, g.library = g_lib, family = "gaussian",
                 adjust_cutoff = NULL, V = 2)

Any ideas on what might be happening?

I think this is related with plan("multisession"). If I run plan("multicore"), it does not give me those error messages, but I am not sure if it is doing anything. Also, if I run plan("multiprocess"), I get the following message:

Warning messages:
1: Strategy 'multiprocess' is deprecated in future (>= 1.20.0). Instead, explicitly specify either 'multisession' or 'multicore'. In the current R session, 'multiprocess' equals 'multisession'. 
2: In supportsMulticoreAndRStudio(...) :
  [ONE-TIME WARNING] Forked processing ('multicore') is not supported when running R from RStudio because it is considered unstable. For more details, how to control forked processing or not, and how to silence this warning in future R sessions, see ?parallelly::supportsMulticore

This whole problem might be related to the future package. I was wondering if there is a way to pass parallel = "multicore" as an argument to varimpact, similarly to how we do for CV.SuperLearner? That way of doing parallelisation seems to be working fine. With that in mind, I tried to change tmle_estimate_q.R line 118 replacing SuperLearner::SuperLearner by SuperLearner::mcSuperLearner and same for tmle_estimate_g.R line 78, and then I run these two R scripts in my computer after loading varimpact. I did not get any error message but not sure if it is working. Any advice?

Once again, thanks very much for your input on this!

HenrikBengtsson commented 2 years ago

A few comments, from easy-to-hard grasp:

  1. multiprocess is deprecated, so the simple answer there is "just don't use it, forget about it, and you don't have to worry about it in your troubleshooting".

  2. multiprocess was just a convenient wrapper for if (parallelly::supportsMulticore()) plan(multicore) else plan(multisession). It turns out that "convenience" added more confusion, especially when people on MS Windows talked to people on Unix and macOS. Now we are asking users to explicitly specify which of the two they want to use. The use of multiprocess also caused some developers on Unix and macOS to never test with multisession - whenever they ran the code themselves, they used multicore (forked processing), which often results in "weird, works for me" comments. So, I always recommend package developers to make sure things work with multisession.

  3. multicore is using, so called, forked parallel processing. That concept is a parallelization mechanism provided by the operating system itself. R only supports forked parallel processing on Unix and macOS. So, if you use plan(multicore) it will work on those two platforms, whereas on MS Windows, it'll fall back to plan(sequential).

  4. Now, not all code is safe to run in forked parallel processing. The combination of forked processing and non-fork safe functions might actually crash/segfault/crash your R session. Exactly when it is safe to "fork" or not, is not easy to know, which complicates everything. One basically have to run through all tests and code to see if it works. Because of this, we (I and others), avoid telling people: "Just use plan(multicore) or parallel::mclapply(...), because it works great and is fast". You see lots of those comments online, but that may not true for your particular pipeline. The safest is always to use plan(multisession) (PSOCK clusters) - if that works, it works on all platforms in all R environments.

  5. Regarding multicore being disabled in RStudio: Please see ?parallelly::supportsMulticore for the motivation, but it's related to 4 (above). It's not just RStudio that may have this problem, but they have officially confirmed it causes problems for a lot of people.

So, I recommend trying to run with plan(multisession). That should always work. If it doesn't, then it's not the end-users fault, but something the (package) developer needs to fix (see also comment 2 above). When you know the code parallelizes fine with plan(multisession) (e.g. that you get the correct results), then I would worry about parallelizing with plan(multicore). You obviously have to do that outside of the RStudio Console, e.g. in a regular terminal, or gamble and re-enable forked processing in the RStudio Console as explained in ?parallelly::supportsMulticore.

Regarding:

"It seems that for me at least multicore parallelisation is not working with varimpact, but it might be I am doing something wrong. ..."

I'm not a varimpact user, so I don't see what the problem really is, but I guess it's that it does not give the same results as when running with plan(sequential). Then I suggest the developer to make sure it works with plan(multisession) - if it doesn't, then there's a bug.

DS-Rodrigues commented 2 years ago

Thanks very much Henrik, this is really helpful.

plan(multisession) does not work for me with varimpact as described above, but I was running it in R Studio. It also does not work in R.

When you say "You obviously have to do that outside of the RStudio Console, e.g. in a regular terminal, or gamble and re-enable forked processing in the RStudio Console as explained in ?parallelly::supportsMulticore." - can I do it in R (instead of R Studio) in the R console there? I've just done it in R. But I run plan(multicore) - I was not aware of the points you highlighted - and it worked!

Once again, thanks very much!

HenrikBengtsson commented 2 years ago

plan(multisession) does not work for me with varimpact as described above, but I was running it in R Studio. It also does not work in R.

'multisession' works equally well in RStudio as when running R in the terminal. My comments above regarding RStudio and parallelization was around 'multicore', i.e. forked processing.

When you say "You obviously have to do that outside of the RStudio Console, e.g. in a regular terminal, or gamble and re-enable forked processing in the RStudio Console as explained in ?parallelly::supportsMulticore." - can I do it in R (instead of R Studio) in the R console there?

Yes, 'multicore' is enabled when running R in the terminal. It's only in the RStudio Console that it's disabled by default.

But I run plan(multicore) - I was not aware of the points you highlighted - and it worked!

Great. So, if I understand it correctly, it works for you in R when you use plan(multicore), but not plan(multisession). If so, I suspect there's something in varimpact that requires forked parallel processing in order for it to work. @ck37, do you have any comments?

ahubb40 commented 2 years ago

I wish I had something to offer here, but Chris (ck37) knows the inner workings. Thanks for all your attention to this.

Alan Hubbard Division of Biostatistics UC Berkeley (510)643-6160 http://hubbard.berkeley.edu

On Wed, Nov 24, 2021 at 12:11 PM Henrik Bengtsson @.***> wrote:

plan(multisession) does not work for me with varimpact as described above, but I was running it in R Studio. It also does not work in R.

'multisession' works equally well in RStudio as when running R in the terminal. My comments above regarding RStudio and parallelization was around 'multicore', i.e. forked processing.

When you say "You obviously have to do that outside of the RStudio Console, e.g. in a regular terminal, or gamble and re-enable forked processing in the RStudio Console as explained in ?parallelly::supportsMulticore." - can I do it in R (instead of R Studio) in the R console there?

Yes, 'multicore' is enabled when running R in the terminal. It's only in the RStudio Console that it's disabled by default.

But I run plan(multicore) - I was not aware of the points you highlighted

  • and it worked!

Great. So, if I understand it correctly, it works for you in R when you use plan(multicore), but not plan(multisession). If so, I suspect there's something in varimpact that requires forked parallel processing in order for it to work. @ck37 https://github.com/ck37, do you have any comments?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ck37/varimpact/issues/20#issuecomment-978188042, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTKDDKFEW3IJYL26JSOR6LUNVBI3ANCNFSM5IKKOCOA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.