ecpolley / SuperLearner

Current version of the SuperLearner R package
271 stars 72 forks source link

Speed issues #125

Closed pspin1 closed 5 years ago

pspin1 commented 5 years ago

I noticed large performance discrepancies between SuperLearner and ranger when fitting a Random Forest model to a data frame with 1200 rows. The outcome is binary and the predictors include three continuous variables, one five-level factor variable, and one binary variable.

SuperLearner takes approximately 13 seconds to fit a default model, whereas ranger fits the default in 0.2 seconds.

This is not a big deal when running one model at a time. It became a major constraint when running a permutation test (999 'simulations') to estimate an empirical p-value for Gini Impurity.

ck37 commented 5 years ago

There are two reasons for the differences: 1) SuperLearner is conducting cross-validation, ranger is not, 2) the SuperLearner SL.ranger wrapper uses only 1 thread by default, which can be changed by using create.Learner() (see e.g. https://github.com/ecpolley/SuperLearner/blob/master/vignettes/Guide-to-SuperLearner.Rmd#L292).

pspin1 commented 5 years ago

Just looping back. This helped a lot, thank you.

The following revision brought run time down to 1.1 seconds.

learners = create.Learner("SL.ranger", params = list(num.threads = 4)) system.time({ out.sl.r4 <- mcSuperLearner(Y = Y, X = X, family = family, method = method, SL.library = learners$names) })