bols for factors with unobserved levels breaks

hofnerb commented 8 years ago

library("mboost")
z <- factor(sample(1:5, 100, replace = TRUE), levels = 1:6)
y <- rnorm(100)
mboost(y ~ bols(z))
## Error in solve.default(XtX, crossprod(X, y), LINPACK = FALSE) : 
##  Lapack routine dgesv: system is exactly singular: U[6,6] = 0

z <- droplevels(z)
mboost(y ~ bols(z)) # works

Thus, perhaps we should use droplevels() within bols and issue a warning if any levels are dropped.

sbrockhaus commented 8 years ago

Reading your issue, I just wondered what happens within cvrisk() when in a fold a factor level is empty. I constructed the following example:

set.seed(123)
z <- factor(sample(1:5, 100, replace = TRUE), levels = 1:6)
y <- rnorm(100)
m <- mboost(y ~ bols(z))

## Create resampling folds 
myfolds <- cv(model.weights(m), "kfold")

# In the first fold, set all observations with factor level 1 to 0
# thus, in this fold this factor level is empty
myfolds[ z == 1 , 1] <- 0 

## cvrisk does not work for first fold
cv1 <- cvrisk(m, folds = myfolds)

## fit the model of the first fold by hand 
## works fine by dropping factor level
y_fold1 <- y[myfolds[ ,1] == 1]
z_fold1 <- z[myfolds[ ,1] == 1]
m_fold1 <- mboost(y_fold1 ~ bols(z_fold1))

## try to fit the same model using weights, breaks with error
m_fold1 <- mboost(y ~ bols(z), weights = myfolds[ , 1])

## Error in solve.default(XtX, crossprod(X, y), LINPACK = FALSE) : 
## system is computationally singular: reciprocal condition number = 2.43337e-18

Do you think this is a problem?

sbrockhaus commented 8 years ago

@davidruegamer does this change in the mboost package affect your resampling using brandom()?

davidruegamer commented 8 years ago

@sbrockhaus you mean the bootstrapped "confidence intervals" for which resampling is done on subject-level? I actually did the droplevels by hand and as I do not have to validate each sample (just extracting coefficients), there should be no problem.

hofnerb commented 8 years ago

We modified cvrisk() such that it doesn't break if single folds break. This was considered reasonable as usually the remaining folds should be sufficient. I see no problem when we drop empty levels and cvrisk is used to estimate the optimal stopping iteration. In contrary, results are now based on more folds and thus more representative.

Regarding confidence intervals:

You know that there is a funciton implementing this? See ?confint.mboost. The function was described in B. Hofner, T. Kneib, T. Hothorn (2016). "A Unified Framework of Constrained Regression". Statistics and Computing. 26:1-14. DOI 10.1007/s11222-014-9520-y
If you construct CIs for factor variables droplevels might be a problem, yet, not using droplevels is a problem as well. The question is: What does it actually mean if a level was dropped? Is it equal to the level beeing estimated as zero? As I use predictions this should be somehow managable. What were your considerations @davidruegamer?

Currently, the following code breaks:

### check confidents intervals for factors with very small level frequencies
z <- factor(c(sample(1:5, 100, replace = TRUE), 6), levels = 1:6)
y <- rnorm(101)
mod <- mboost(y ~ bols(z))
confint(mod)

hofnerb commented 8 years ago

I moved this to a new issue as it touches a similar yet distinct problem. The original issue was solved with the update.

boost-R / mboost

bols for factors with unobserved levels breaks #47