boost-R / mboost

Boosting algorithms for fitting generalized linear, additive and interaction models to potentially high-dimensional data. The current relase version can be found on CRAN (http://cran.r-project.org/package=mboost).
73 stars 27 forks source link

function to summarize loss reductions achieved by each base-learner #27

Open fabian-s opened 8 years ago

fabian-s commented 8 years ago

Almond Stöcker und Tobi Kühn have started to write a function to extract the amount of risk reduction contributed by each base-learner from a fitted mboost model. We'd like to see this included in mboost, with visualisation options, if possible, as it seems to answer a question that comes up frequently in postprocessing and interpreting boosting fits: which base-learners are the most important for the fit? See code below.

# variable importance
RelRisk <- function( model, ohne1 = FALSE )
{
  baselearner <- names(model$baselearner)
  learner.type = sapply( strsplit(baselearner, "\\(") , "[[", 1)
  learner.type.new = as.vector(sapply(learner.type, FUN = function(x) 
    switch(x, bols = "(linear)",
              bbs  = "(nonlinear)",
              bspatial = "(spatial)")))
  var.names    = as.vector(variable.names(model))  

  baselearner.short = paste(var.names, learner.type.new) 
  n <- length(model$response)

  # Which Baselearners were selected while boosting:
  selected <- model$xselect()

  # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  # Emp. Risk:
  # Initial Risk for the Intercept Model:
  Risk0 <- with( model, family@risk( response, offset ) )
  # Risk after the Boosting-Steps:
  Risk <- model$risk()
  # Risk loss per step:
  RiskDif <- c(Risk0, Risk[-length(Risk)]) - Risk
  # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  # Falls gewuenscht, ohne den ersten Schritt:
  if(ohne1) 
  {
    RiskDif <- RiskDif[-1]
    selected <- selected[-1]
  }

  # Explained Risk attributed to Baselearners
  explained <- rep( 0, length(baselearner) )
  for( i in 1:length(baselearner)) explained[i] <- sum( RiskDif[which(selected==i)] )

  # Selection percentage of the baselearners
  frequence <- rep(0, length(baselearner))
  for( i in 1:length(baselearner)) frequence[i] <- mean( selected == i )

  par(mar = c(5, 13, 4, 2) + 0.1 )
  (b <- barplot(height = explained / n , names.arg = baselearner.short, las = 1, horiz = TRUE, main = paste("Endrisiko =", Risk[length(Risk)] / n) ) )
  text(x = max(explained/n)/10, y = b, labels = frequence)
  par(mar = c(5, 4, 4, 2) + 0.1 )
}

> cars.gb <- gamboost(dist ~ bols(speed) + bbs(speed, center=TRUE), data = cars,
+   control = boost_control(mstop = 50))
> RelRisk(cars.gb)

image

Any input on what such a function should or should not do would be highly appreciated!

ja-thomas commented 8 years ago

Hi Fabian,

awesome!

Some notes/thoughts:

Here is how it looks for my data:

RelRisk(zero) example RelRisk(zero, ohne1= FALSE) example2

and to compare my ggplot solution: zero_importance.pdf