ecpolley / SuperLearner

Current version of the SuperLearner R package
272 stars 72 forks source link

Multicore needs a L'Ecuyer RNG for replicability #32

Closed ck37 closed 8 years ago

ck37 commented 8 years ago

Hi,

I was having issues with replicability for an SL analysis (R 3.3.0, SL 2.0-19) and learned that for mcSuperLearner it is important to use L'Ecuyer random numbers rather than the default. This only requires a minor adjustment to the set.seed call: set.seed(x, "L'Ecuyer"). So I think the main implication would be to tweak the multicore example provided in ?SuperLearner. It is important though because people may not realize that their current analyses are not replicable if they used mcSuperLearner and the default RNG.

Here is some test code I developed to figure out what was going on. It could be adapted into a test case at some point.

library(SuperLearner)
library(glmnet)

# Create test dataset.
set.seed(1)
N = 200
X = as.data.frame(matrix(rnorm(N*10), N, 10))
Y = rbinom(N, 1, plogis(.2*X[, 1] + .1*X[, 2] - .2*X[, 3] + .1*X[, 3]*X[, 4] - .2*abs(X[, 4])))

sl_lib = c("SL.glmnet", "SL.mean", "SL.svm")

# Function to run an SL configuration twice and check if coefficients are the same.
# Repeats this test multiple times and returns the success check results.
check_replication = function(f, repeats = 10, cluster = NULL, lecuyer=F) {
  results = rep(NA, repeats)
  for (i in 1:repeats) {
    # Run the SL function twice check if the coefficients are the same.
    # The SL function is expected to setup its seed in a deterministic manner.
    test1 = f()
    test2 = f()

    # Save whether or not the coefficients are the same for this repetition.
    results[i] = all.equal(coef(test1), coef(test2)) == T
  }
  results
}

# Reset RNG kind to the default; the Lecuyer test (#3) will change the RNG kind.
reset_rng_kind = function() {
  # Revert RNG kind to default.
  RNGkind("default")
  # Return the current RNG kind setting, for verification.
  RNGkind()
}

# Make sure we begin with the default RNG kind.
reset_rng_kind()

####################################
#1. Check standard SL.

replicateSL = function() {
  set.seed(1)
  SuperLearner(Y = Y, X = X, SL.library = sl_lib, family = binomial())
}
results_sl = check_replication(replicateSL)
# Replicable - mean expected be 1.
mean(results_sl)

####################################
#2. Check multicore SL with default RNG kind.

replicateMCSL = function() {
  set.seed(1)
  mcSuperLearner(Y = Y, X = X, SL.library = sl_lib, family = binomial())
}
results_mc = check_replication(replicateMCSL)
# Not replicable - mean expected to differ from 1.
mean(results_mc)

####################################
#3. Check multicore SL with lecuyer RNG kind specified.

replicateMCSL_lecuyer = function() {
  # This RNG tweak is recommended in ?parallel::mclapply
  set.seed(1, "L'Ecuyer")
  mcSuperLearner(Y = Y, X = X, SL.library = sl_lib, family = binomial())
}
results_mc = check_replication(replicateMCSL_lecuyer)
# Replicable - mean expected to be 1.
mean(results_mc)

####################################
#4. Check snow SL.

# Rest RNG kind because previous test changed the default.
reset_rng_kind()
cluster = makeCluster(2)
replicateSnowSL = function() {
  # We need to set the normal seed and the clusterRNGStream.
  set.seed(1)
  clusterSetRNGStream(cluster, iseed = 2343)
  snowSuperLearner(Y = Y, X = X, SL.library = sl_lib, family = binomial(), cluster=cluster)
}
results_snow = check_replication(replicateSnowSL)
# Replicable - mean expected to be 1.
mean(results_snow)
stopCluster(cluster)

#########################################
#5. Try multicore example from documentation.

## examples with multicore
set.seed(23432)
## training set
n <- 500
p <- 50
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)

# Skip test set generation to speed up test.

# generate Library and run Super Learner
SL.library <- c("SL.glm", "SL.randomForest", "SL.gam",
                "SL.polymars", "SL.mean")

testMC = function() {
  set.seed(1)
  # We remove the newX prediction as it slows down the test.
  mcSuperLearner(Y = Y, X = X, SL.library = SL.library)
}
# This takes a good 10 minutes or so.
results_mcex = check_replication(testMC)
# Not replicable - mean expected to differ from 1.
mean(results_mcex)

#########################################
#6. Revise multicore example from documentation.

testMC_lecuyer = function() {
  set.seed(1, "L'Ecuyer")
  mcSuperLearner(Y = Y, X = X, SL.library = SL.library)
}
# This takes a good 10 minutes or so.
results_mcex2 = check_replication(testMC_lecuyer)
# Replicable - mean expected to be 1.
mean(results_mcex2)
ecpolley commented 8 years ago

Thanks. Updated help docs for mcSuperLearner.