ecpolley / SuperLearner

Current version of the SuperLearner R package
272 stars 72 forks source link

predict.SuperLearner throwing errors #122

Closed hanson1005 closed 5 years ago

hanson1005 commented 5 years ago

Hi. I am training a model using a small sample data and then try to predict values for a huge amount of the out-of-sample data by using "predict.SuperLearner". However, it constantly gives me an error saying:

"Error in object$whichScreen : $ operator is invalid for atomic vectors"

Here is my code: sl<-SuperLearner(Y = label, X = train, SL.library=c("SL.randomForest", "SL.glmnet", "SL.svm"), method = "method.NNLS", verbose=TRUE)

pred.sl <- predict.SuperLearner(sl, newdata=test, onlySL = T)

Even though I define X and Y in the predict command, still get the same error. What does this error mean? How can I solve this problem? Does it have to do with the fact that no screening algorithms are selected in the SL.library?

As a leeway, I have been using predict function in SuperLearner command by defining newdata in the code. However, since the out-of-sample data is way too big, I have subset the data and fit the same model multiple times and then combine the output. It works fine this way, but it is painful to do so especially because fitting one model takes forever.

Anyway, I want to know how to debug this issue. Thanks!

ecpolley commented 5 years ago

Do you think you can put together a reproducible example with the error? I tried the following and it works without error. A few things to check are the version of SuperLearner (I tested 2.0-24) and make sure the column names are the same between train and test: setdiff(colnames(train), colnames(test))

library(SuperLearner)
set.seed(23432)
## training set
n <- 500
p <- 50
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)
m <- 1000
newX <- matrix(rnorm(m*p), nrow = m, ncol = p)
colnames(newX) <- paste("X", 1:p, sep="")
newX <- data.frame(newX)
newY <- newX[, 1] + sqrt(abs(newX[, 2] * newX[, 3])) + newX[, 2] - newX[, 3] + rnorm(m)

sl <- SuperLearner(Y = Y, X = X, SL.library=c("SL.randomForest", "SL.glmnet", "SL.svm"), method = "method.NNLS", verbose=TRUE)
pred.sl <- predict.SuperLearner(sl, newdata=newX, onlySL = T)
hanson1005 commented 5 years ago

Hi. Thanks for the quick response. I am using the version 2.0-24 and the column names are the same between train and test set because they are subsets of the same data frame. Actually, I inquired the same question using the same data last year around this time, but you confirmed that the predict command does not work for you using my data either. Could you please take a look into this?

I am sending my data to be fed as X, Y and newX. The codes I was using are as follows:

sl_samples <- SuperLearner(Y = label, X = train, SL.library=c("SL.randomForest"), method = "method.NNLS", verbose=TRUE)

pred <- predict.SuperLearner(sl_samples, newdata = test, X=train, Y=label)

Also, when I increase the number of variables in both the train and the test set and fit a SuperLearner model, I get an error message saying "undefined columns selected," but this doesn't make sense because my data has no column with only NAs or 0's. The exact error message is as below:

CV SL.randomForest_All Error in [.data.frame(x, r, vars, drop = drop) : undefined columns selected In addition: Warning message: In FUN(X[[i]], ...) : Error in algorithm SL.randomForest The Algorithm will be removed from the Super Learner (i.e. given weight 0)

Can you please take a look and see why I am getting this message? Should I make another separate thread to your github? Thank you!

Best,

Julia

On Tue, Nov 27, 2018 at 11:02 AM Eric Polley notifications@github.com wrote:

Do you think you can put together a reproducible example with the error? I tried the following and it works without error. A few things to check are the version of SuperLearner (I tested 2.0-24) and make sure the column names are the same between train and test: setdiff(colnames(train), colnames(test))

library(SuperLearner) set.seed(23432)

training set

n <- 500 p <- 50 X <- matrix(rnorm(np), nrow = n, ncol = p) colnames(X) <- paste("X", 1:p, sep="") X <- data.frame(X) Y <- X[, 1] + sqrt(abs(X[, 2] X[, 3])) + X[, 2] - X[, 3] + rnorm(n) m <- 1000 newX <- matrix(rnorm(mp), nrow = m, ncol = p) colnames(newX) <- paste("X", 1:p, sep="") newX <- data.frame(newX) newY <- newX[, 1] + sqrt(abs(newX[, 2] newX[, 3])) + newX[, 2] - newX[, 3] + rnorm(m)

sl <- SuperLearner(Y = Y, X = X, SL.library=c("SL.randomForest", "SL.glmnet", "SL.svm"), method = "method.NNLS", verbose=TRUE)pred.sl <- predict.SuperLearner(sl, newdata=newX, onlySL = T)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ecpolley/SuperLearner/issues/122#issuecomment-442136624, or mute the thread https://github.com/notifications/unsubscribe-auth/ASRWri8DtY6Qi_eag7LPkXTr5KZPs3tgks5uzXAkgaJpZM4Y0w-D .

-- Ju Yeon Julia Park Ph.D. in Politics New York University

ecpolley commented 5 years ago

Thanks Julia, we can keep the discussion in this thread. I wonder if this is related to an invalid variable name. One thing you might try is prior to running the SuperLearner(), try cleaning up the variable names in the full data frame.

colnames(X) <- make.names(colnames(X), unique = TRUE)
hanson1005 commented 5 years ago

Hi. I confirmed that the variable names are unique in my data. Even with the code you provided, the predict.SuperLearner function is not working. Have you tried to check out my data I sent you? I would appreciate if you can solve this problem with "predict.SuperLearner" function and its error message.

Thank you!!

ecpolley commented 5 years ago

Thanks for trying the suggestion. Can you email me a link to the data again? I found the previous one but it is no longer valid.

hanson1005 commented 5 years ago

Below I am attaching the files I used. Thanks!

On Mon, Dec 3, 2018 at 7:06 AM Eric Polley notifications@github.com wrote:

Thanks for trying the suggestion. Can you email me a link to the data again? I found the previous one but it is no longer valid.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ecpolley/SuperLearner/issues/122#issuecomment-443703473, or mute the thread https://github.com/notifications/unsubscribe-auth/ASRWrgSzLUouiZiXl8KJEnujYa1OAmnpks5u1SG3gaJpZM4Y0w-D .

-- Ju Yeon Julia Park Ph.D. in Politics New York University

ecpolley commented 5 years ago

OK, I don't have a code fix yet, but the problem appears to be caused by using 'object' as a column name. Here is a reproducible example of the error:

library(SuperLearner)
set.seed(23432)
## training set
n <- 500
p <- 50
X <- matrix(rnorm(n*p), nrow = n, ncol = p)
colnames(X) <- paste("X", 1:p, sep="")
colnames(X)[1] <- 'object'
X <- data.frame(X)
Y <- X[, 1] + sqrt(abs(X[, 2] * X[, 3])) + X[, 2] - X[, 3] + rnorm(n)
m <- 1000
newX <- matrix(rnorm(m*p), nrow = m, ncol = p)
colnames(newX) <- paste("X", 1:p, sep="")
colnames(newX)[1] <- 'object'
newX <- data.frame(newX)
newY <- newX[, 1] + sqrt(abs(newX[, 2] * newX[, 3])) + newX[, 2] - newX[, 3] + rnorm(m)

sl <- SuperLearner(Y = Y, X = X, SL.library=c("SL.randomForest", "SL.glmnet", "SL.svm"), method = "method.NNLS", verbose=TRUE)
pred.sl <- predict.SuperLearner(sl, newdata=newX, onlySL = T)

As a temporary fix, rename the column to something other than 'object' in both the train and test data.frame.

ecpolley commented 5 years ago

For example, try

colnames(train) <- sub("object", "Xobject", colnames(train))
colnames(test) <- sub("object", "Xobject", colnames(test))