Prediction problems with missing factor levels

timolingh commented 8 years ago

I'm having a problem trying to use a glmboost object to predict values with a new dataset. The new dataset has 1 less level for one of it's categorical variables. I think this should still work and if you use predict on a vanilla 'lm' object it does work. Can you advise?

rm(list = ls())

library(mboost)
library(data.table)

## A data table with continuous x, y, binary z, and categorical q
set.seed(8888)
foo <- data.table(x = rnorm(10000, 2, 1), y = rnorm(10000, 3, 3), z = rbinom(10000, 1, 0.2), 
              q = sample(c("a", "b", "c", "d", "e"), size = 10000, replace = T))

## Generate dependent variable (u) with noise. Also has interaction terms
foo[, u := 22 * x + 34 * y + 1 * (z * x) +  sapply(q, switch, a = 0.8, b = 1.5, c = 0.5, 0.8) +    rnorm(10000, 0, 5)]

foo[, `:=`(q = as.factor(q), z = as.factor(z))]

## New dataset for prediction.  Note that q is missing level "e"
bar <- data.table(x = rnorm(10000, 2, 1), y = rnorm(10000, 3, 3), z = rbinom(10000, 1, 0.2),  q =         sample(c("a", "b", "c", "d"), size = 10000, replace = T))
bar[, `:=`(q = as.factor(q), z = as.factor(z))]

## Model spec - has interaction terms with z
fm <- as.formula("u ~ (x + y + q) * z")

## Fit model with base 'lm' function
summary(lm1 <- lm(fm, data = foo))

## Prediction works
lm1.predict <- predict(lm1, newdata = bar)

## Now fit a boosted model
summary(lm2 <- glmboost(fm, data = foo, control = boost_control(mstop = 1000, nu = 0.1)))

## The predict call errors with "Error in scale.default(X, center = cm, scale = FALSE) : length of 'center' must equal the number of columns of 'x'"
coef(lm2[104], which = "")
lm2.pred <- predict(lm2, newdata = droplevels(bar))`

mayrandy commented 8 years ago

Hi! Well, as you said: the problem is simply that factor level "e" does not exist in your second data-set (somehow similar to #47).

The behavior of mboost in this case is not extremely user-friendly and the error message unfortunately does not help a lot. Anyhow, if you really just need those predictions, a solution (as you might know anyway) is of course:

# set equal levels 
levels(bar$q) <- levels(foo$q)

# omitting droplevels
lm2.pred <- predict(lm2, newdata = bar)

timolingh commented 8 years ago

Thank you. Somehow after all this time, I did not know that.

boost-R / mboost

Prediction problems with missing factor levels #57