ecpolley / SuperLearner

Current version of the SuperLearner R package
272 stars 72 forks source link

CV.SuperLearner fails with #100

Closed tedholzman closed 3 years ago

tedholzman commented 7 years ago

Hi.

I am trying to do a number of analyses with snowSuperLearner and CV.SuperLearner. CV.SuperLearner fails with this error:

Error in clusterApply(cl, x = splitList(X, length(cl)), fun = lapply, : formal argument "x" matched by multiple actual arguments Calls: system.time ... CV.SuperLearner -> parLapply -> do.call -> clusterApply

The offending CV.SuperLearner call looks like this:

system.time(sl_cv_fit <- CV.SuperLearner(Y = Y, X = X, SL.library = SL.library, verbose = TRUE, method = "method.NNLS", cvControl=list(V=10), parallel=cl,control = list(saveFitLibrary = TRUE)))

cl is a FORK type cluster with 10 nodes.

The statement within CV.SuperLearner that fails appears to be'

cvList <- parLapply(parallel, x = folds, fun = .crossValFun, Y = Y, dataX = X, family = family, SL.library = SL.library, method = method, id = id, obsWeights = obsWeights, verbose = verbose, control = control, cvControl = cvControl, saveAll = saveAll)

It is being run on 64 bit computer a very large memory capacity. The R version is 3.3.3.

Oh. This error occurs as soon as control hist that parLapply call. On a different computer, with less memory (same R version) it fails with a "cannot allocate 4G vector" error -- after about 10 hours of computing.

Can you give me any advice?

Thanks. --Ted

ck37 commented 7 years ago

Hi Ted,

Sorry to hear that, hopefully we can figure this out. Can you give a few more items of info please:

If you use snowSuperLearner rather than CV.SuperLearner does the analysis complete?

Thanks, Chris

ledell commented 7 years ago

@ck37 This looks vaguely familiar to me. When you try to use parLapply(cl = NULL, X, fun, ...) on a function that contains an argument named X, it gets confused because X matches two arguments. However, that's why the .crossValFun() wrapper function uses an argument called dataX instead of X, so I am confused as to why it's hitting this error.

@tedholzman The second error you're getting, "cannot allocate 4G vector", means you've hit a memory limitation. You may be able to get around that by using fewer CV folds. However, if your data is just too big, you could try using the subsemble package (allowing you to keep your same base learner library), the h2oEnsemble package or the h2o package. These alternatives I listed do not yet have a built-in outer cross-validation function (a la CV.SuperLearner()), so you'd also have to write some extra code for that.