boost-R / mboost

Boosting algorithms for fitting generalized linear, additive and interaction models to potentially high-dimensional data. The current relase version can be found on CRAN (http://cran.r-project.org/package=mboost).
73 stars 27 forks source link

confint with sparse factor levels breaks #49

Open hofnerb opened 8 years ago

hofnerb commented 8 years ago

see #47

davidruegamer commented 8 years ago

Regarding your comment in #47 : My use case is only partly related to mboost as we calculated bootstrap intervals for functional effects fitted via FDboost and using confint would actually be too complicated.

If you construct CIs for factor variables droplevels might be a problem, yet, not using droplevels is a problem as well. The question is: What does it actually mean if a level was dropped? Is it equal to the level beeing estimated as zero? As I use predictions this should be somehow managable. What were your considerations [...]?

In my case, resampling was done on the level of correlated observations, i.e. on subject level, with each subject having gone through every other possible study setting. So I actually did not had to deal with sparse factor levels (and dropping levels should be fine for random effects?). But in general, if there happen to be unfilled categories in far more than one sample I would throw an error. In this case, I would say, the problem falls back to insufficient informative data for mimicing the true distribution $F_{Y,X}$ and therefore is not a problem of mboost. I'm not quite sure what exactly happens in confint at the moment when there are unfilled levels, but if there are just a hand full of samples with unfilled categories, I still would not set the estimate to zero. I would rather change the behaviour of .ci_mboost to calculate the intervals for this specific factor (level) only on the basis of those samples, in which all levels of the factor variable are present.

hofnerb commented 8 years ago

Just to understand you correctly: You were computing CIs for fixed effects and were not interestesd in the random effects? In that case I would agree that dropping unused levels should not pose any problem.

Regarding the second part of your answer I have to rethink this. In a parametric setting, the CI would get rather big in that case as the standard error gets large.

With setting the estimate to zero I meant only the estimate on the current fold which then becomes the basis for the CI. However, that isn't correct either as you are right. Currently the code just breaks. Perhaps we keep this behavior and simply throw a more informative error to let the user know that sparse categories hamper the computation of bootstrap CIs. Well; I have to check this in a small simulation....

davidruegamer commented 8 years ago

Just to understand you correctly: You were computing CIs for fixed effects and were not interestesd in the random effects? In that case I would agree that dropping unused levels should not pose any problem.

Yes, exactly. Thanks for the response!

With setting the estimate to zero I meant only the estimate on the current fold which then becomes the basis for the CI.

So did I. But I think precisely this proceeding is problematic. For example, think about a model for the probability of suffering a stroke (Yes / No). If there is a factor variable "suffered_stroke_before", which is zero / FALSE for most observations but highly predictive for Yes if one / TRUE, you certainly do not want to set the effect to zero for a large number of folds (though the corresponding confidence interval would probably just touch and not cross the value zero).

Perhaps we keep this behavior and simply throw a more informative error to let the user know that sparse categories hamper the computation of bootstrap CIs.

It's probably for the best. I would even go so far as to say, that CIs on the basis of bootstrapped (shrinked) boosting coefficients is a feature for advanced user (which are aware of the origin of those intervals) and throwing an error is in line with the actual purpose of the function (rather a "I'm aware of intervals do not necessarily comply with the nominal level and are biased due to the shrinkage"-function, than a a black box interval function, which always returns something).

hofnerb commented 7 years ago

Start for test:

### check confidents intervals for factors with very small level frequencies
z <- factor(c(sample(1:5, 100, replace = TRUE), 6), levels = 1:6)
y <- rnorm(101)
mod <- mboost(y ~ bols(z))
confint(mod)

(to be added to tests/regtest-inference.R)