Closed hofnerb closed 8 years ago
Reading your issue, I just wondered what happens within cvrisk() when in a fold a factor level is empty. I constructed the following example:
set.seed(123)
z <- factor(sample(1:5, 100, replace = TRUE), levels = 1:6)
y <- rnorm(100)
m <- mboost(y ~ bols(z))
## Create resampling folds
myfolds <- cv(model.weights(m), "kfold")
# In the first fold, set all observations with factor level 1 to 0
# thus, in this fold this factor level is empty
myfolds[ z == 1 , 1] <- 0
## cvrisk does not work for first fold
cv1 <- cvrisk(m, folds = myfolds)
## fit the model of the first fold by hand
## works fine by dropping factor level
y_fold1 <- y[myfolds[ ,1] == 1]
z_fold1 <- z[myfolds[ ,1] == 1]
m_fold1 <- mboost(y_fold1 ~ bols(z_fold1))
## try to fit the same model using weights, breaks with error
m_fold1 <- mboost(y ~ bols(z), weights = myfolds[ , 1])
## Error in solve.default(XtX, crossprod(X, y), LINPACK = FALSE) :
## system is computationally singular: reciprocal condition number = 2.43337e-18
Do you think this is a problem?
@davidruegamer does this change in the mboost
package affect your resampling using brandom()?
@sbrockhaus you mean the bootstrapped "confidence intervals" for which resampling is done on subject-level? I actually did the droplevels
by hand and as I do not have to validate each sample (just extracting coefficients), there should be no problem.
We modified cvrisk()
such that it doesn't break if single folds break. This was considered reasonable as usually the remaining folds should be sufficient. I see no problem when we drop empty levels and cvrisk
is used to estimate the optimal stopping iteration. In contrary, results are now based on more folds and thus more representative.
Regarding confidence intervals:
?confint.mboost
.
The function was described in B. Hofner, T. Kneib, T. Hothorn (2016). "A Unified Framework of Constrained Regression". Statistics and Computing. 26:1-14. DOI 10.1007/s11222-014-9520-ydroplevels
might be a problem, yet, not using droplevels
is a problem as well. The question is: What does it actually mean if a level was dropped? Is it equal to the level beeing estimated as zero? As I use predictions this should be somehow managable. What were your considerations @davidruegamer?Currently, the following code breaks:
### check confidents intervals for factors with very small level frequencies
z <- factor(c(sample(1:5, 100, replace = TRUE), 6), levels = 1:6)
y <- rnorm(101)
mod <- mboost(y ~ bols(z))
confint(mod)
I moved this to a new issue as it touches a similar yet distinct problem. The original issue was solved with the update.
Thus, perhaps we should use
droplevels()
withinbols
and issue a warning if any levels are dropped.