ecpolley / SuperLearner

Current version of the SuperLearner R package
272 stars 72 forks source link

SuperLearner not achieving the best AUC average #115

Closed bdnffreud closed 6 years ago

bdnffreud commented 6 years ago

I encountered a surprising result with SuperLearner. With a 5-fold cross validation, and method.AUC selected, the Superlearner ensemble AUC average is not the best:

CV.SuperLearner(Y = Y, X = X, V = 5, family = binomial(), SL.library =SL.library, method = "method.AUC", control = list(saveFitLibrary = TRUE), cvControl = list(stratifyCV = TRUE), parallel = cluster)

Risk is based on: Area under ROC curve (AUC)

All risk estimates are based on V = 5

          Algorithm     Ave se     Min     Max
      Super Learner 0.78177 NA 0.73855 0.83446
        Discrete SL 0.78536 NA 0.73005 0.84282
      SL.glmnet_All 0.77739 NA 0.72042 0.83147
SL.glmnet.ridge_All 0.78536 NA 0.73005 0.84282
SL.glmnet.eln25_All 0.78165 NA 0.72535 0.83538
SL.glmnet.eln50_All 0.77906 NA 0.71877 0.83156
SL.glmnet.eln75_All 0.77694 NA 0.71935 0.83035
         SL.glm_All 0.75893 NA 0.69162 0.83198
         SL.gam_All 0.76102 NA 0.67717 0.82567
    SL.bayesglm_All 0.77321 NA 0.72450 0.83184
       SL.earth_All 0.61505 NA 0.51409 0.66740 

SL.randomForest1000_All 0.73092 NA 0.64227 0.78205 If I exclude SL.glmnet.ridge from the list of algorithms, then the SuperLearner ensemble has the best AUC:

          Algorithm     Ave se     Min     Max
      Super Learner 0.77369 NA 0.70431 0.81643
        Discrete SL 0.75898 NA 0.69051 0.81561
      SL.glmnet_All 0.76792 NA 0.69373 0.80758
SL.glmnet.eln25_All 0.77132 NA 0.69684 0.81561
SL.glmnet.eln50_All 0.77081 NA 0.69051 0.82398
SL.glmnet.eln75_All 0.76891 NA 0.69171 0.81191
         SL.glm_All 0.75996 NA 0.67298 0.79179
         SL.gam_All 0.76273 NA 0.68717 0.80089
    SL.bayesglm_All 0.76603 NA 0.69563 0.79984
       SL.earth_All 0.65069 NA 0.59185 0.69969

SL.randomForest1000_All 0.73585 NA 0.70240 0.77573

Is there any explanation for the SuperLearner ensemble not outperforming the top base algorithm (SL.glmnet.ridge)?

ecpolley commented 6 years ago

What is the sample size? In finite samples the ensemble isn't mathematically guaranteed to have the best cross-validated risk estimate.

bdnffreud commented 6 years ago

The sample size is 9827. The dataset has a high level of imbalance (outcome= 112).

ecpolley commented 6 years ago

I assume this is an observed data set (not simulation) so you don't know the true AUC for these predictors?

I would say this isn't surprising and can occasionally happen. It is rare for the cross-validation risk estimate from the ensemble to be worse than an individual predictor, but it can happen.

bdnffreud commented 6 years ago

Yes, this is an observed dataset.

Thank you for your help!