ecpolley / SuperLearner

Current version of the SuperLearner R package
272 stars 72 forks source link

SuperLearner() and friends now tolerate single-column X #105

Closed saraemoore closed 6 years ago

saraemoore commented 6 years ago

If provided with a single-column X, SuperLearner(), snowSuperLearner(), mcSuperLearner(), and SampleSplitSuperLearner() would fail ("all algorithms dropped from library") because whichScreen was not simplified as desired (t(sapply(...)) led to an array rather than a matrix, in this case). This bug has been fixed. Some tabbing and extra spaces have been cleaned up in SuperLearner() and SampleSplitSuperLearner() (sorry for the big diffs).

ck37 commented 6 years ago

Thanks Sara, do you think this could have a quick unit test to confirm that these updates work for a single-column dataframe?

saraemoore commented 6 years ago

I had a feeling you might request that. :) Do you want unit tests for all affected functions or will one just for SuperLearner() suffice?

ck37 commented 6 years ago

It would be best to test each function, that way we confirm that they work and aren't unknowingly broken for this data scenario in the future.

ecpolley commented 6 years ago

Thanks! Do you have an example where SuperLearner fails? is it only the case if the 'SL.library' includes a screening algorithm but the data.frame has only a single variable (which should raise other concerns, but maybe not fail the entire SL). The following works:

library(SuperLearner)

set.seed(23432)
## training set
n <- 500
p <- 5
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)

df_X <- data.frame(X1 = X[, 1]) # only single variable in the data frame
# generate Library and run Super Learner
SL.library <- c("SL.glm", "SL.randomForest", "SL.gam","SL.polymars", "SL.mean")
test <- SuperLearner(Y = Y, X = df_X,  SL.library = SL.library)
test
saraemoore commented 6 years ago

That's an oversight on my part -- I was only testing it in the case where multiple screening algorithms were specified. So, to be more specific, the failure occurs when two or more screening algorithms are used and X is single-column, like so:

library(SuperLearner)
set.seed(23432)
n <- 500
p <- 5
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)
SL.library <- list(c("SL.glm", "screen.corRank"), c("SL.randomForest", "All"))
test <- SuperLearner(Y = Y, X = X[, 1, drop = FALSE], SL.library = SL.library)

I understand that this isn't a case that anyone should normally run into -- mine is kind of a specialized case where some variable selection happens ahead of time. To be fair, though, unless all columns of X are dropped, there's no reason it should fail. Also, in the case of a single-column X and a single screening algorithm, SuperLearner only works 'by accident' -- whichScreen should contain columns for each variable and rows for each screener, but in this case, whichScreen is transposed:

library(SuperLearner)
set.seed(23432)
n <- 500
p <- 5
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)
SL.library <- c("SL.glm", "SL.randomForest", "SL.gam","SL.polymars", "SL.mean")
test <- SuperLearner(Y = Y, X = X[, 1, drop = FALSE], SL.library = SL.library)
test$whichScreen
##       All
## [1,] TRUE

This pull request fixes that issue:

devtools::install_github("saraemoore/SuperLearner")
library(SuperLearner)
set.seed(23432)
n <- 500
p <- 5
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)
SL.library <- c("SL.glm", "SL.randomForest", "SL.gam","SL.polymars", "SL.mean")
test <- SuperLearner(Y = Y, X = X[, 1, drop = FALSE], SL.library = SL.library)
test$whichScreen
##     [,1]
## All TRUE
saraemoore commented 6 years ago

Sorry -- busy week and didn't have a chance to write the tests, but can still take care of them over the weekend if you'd like. Let me know.

ecpolley commented 6 years ago

@saraemoore I've always thought of this as an odd edge case (single variable in X and at least 1 screening algorithm for the variables). This error had come up in the past, but previously I'd decided to not fix it because it represents an ill-posed SL.library. I don't think we need a test directly for this case. What might help is some logical warning messages about the the algorithms in the library relative to the data. For example we could have a warning message if the user provides a screening algorithm but only has a single variable in X.

saraemoore commented 6 years ago

Sounds good. I submitted a PR that adds warnings for this particular case. I now see what you mean about it coming up in the past (i.e. @lendle reported this issue a few years ago).