boost-R / mboost

Boosting algorithms for fitting generalized linear, additive and interaction models to potentially high-dimensional data. The current relase version can be found on CRAN (http://cran.r-project.org/package=mboost).
73 stars 27 forks source link

Plotting variable selection frequency and coefficients #52

Closed timolingh closed 8 years ago

timolingh commented 8 years ago

Hi, I have a need to display the variable selection frequency and the coefficients in a single plot. I ended up writing my own code (maybe unnecessarily) to do this, but it would be good to have this done as a standard function. Here is an example of the code that generated the plot (attached)

Rplot.pdf

modelSummary <- function(mdl) {

  # Get the variable index and frequency
  (lm.var.index <- mdl$basemodel[[1]]$Xnames)
  (x.select <- mdl$xselect())
  (x.select.readable <- sapply(x.select, function (x) lm.var.index[x]))

  # Get the coefficient values
  (x.coef <- coef(mdl)[lm.var.index])

  # Get the number of boost iterations in the model
  (stop.value <- mdl$control$mstop)

  # The variable importance based on selection frequency
  sel <- x.select.readable
  var.imp <- data.table(sel)[, .(freq = .N / stop.value), by = sel]

  # Add the coef
  x.coef <- coef(mdl)[var.imp[, sel]]
  var.coef <- data.table(sel = names(x.coef), x.coef)

  #Merge the two data tables into a variable summary
  setkey(var.imp, sel)
  setkey(var.coef, sel)
  var.sum <- var.coef[var.imp]

}

# impact.mdl is a mboost object
impact.mdl <- modelSummary(mdl)

# renames the variables
var.readable <- sapply(1:27, as.character)
impact.mdl[, var.readable := var.readable]

# uses ggplot2
(ggplot(impact.mdl, aes(reorder(var.readable, freq), freq)) + geom_bar(stat = "identity") + 
  geom_label(aes(label = sprintf("%.3f", x.coef), color = (x.coef < 0)), show.legend = F) +
  labs(x = "Variable", y = "Selection frequency" ) +
  coord_flip() + 
  scale_color_stata()
)
hofnerb commented 8 years ago

Thanks a lot for the code.

Admittedly, the results look very nice. However, I consider this to be a very specific problem. Usually, it should be sufficient to have separate displays of coefficients and variable importance (see PR29 for a way to plot VarImp). Furthermore, we currently do not use ggplot and I do not plan to use it for this package as it adds a lot of dependencies. Finally, it will only work for glmboost models and thus exclude all tree-based models and gamboostmodels.

Please note that you should never directly access slots of the results but preferably use the provided funcitons such as:

selected(mdl)
variable.names(mdl, which = )
variable.names(mdl, which = "") ## for all names
mstop(mdl) 

For a (rather) comprehensive list of methods and extractor functions please see ?mboost_methods.