boost-R / gamboostLSS

Boosting models for fitting generalized additive models for location, shape and scale (GAMLSS) to potentially high dimensional data. The current relase version can be found on CRAN (https://cran.r-project.org/package=gamboostLSS).
26 stars 11 forks source link

Problem on calculating risk in gamboostLSS with noncyclical approach #56

Open FAUBZhang opened 5 years ago

FAUBZhang commented 5 years ago

Consider the following Gaussian distribution example:

library(gamboostLSS)

# g_muvsi is a function that allows to see which distribution
# parameter is updated in each iteration.
# If mu is updated, then returns 1, if sigma is updated, then it returns 2.
# The first value in the risk vector is removed as it is the initial value.
g_muvsi <- function(gamboostLSS_object) {
  df1 <- data.frame(risk = risk(gamboostLSS_object)$mu[-1], muvsi = 1)
  df2 <- data.frame(risk = risk(gamboostLSS_object)$sigma[-1], muvsi = 2)
  df <- rbind(df1, df2)
  df <- df[order(df$risk, decreasing = TRUE), ]
  return(df$muvsi)
}

set.seed(1)
n <- 3
x1 <- c(1,2,3)

mu <- x1
sigma <- exp(.1 * x1)

data <- cbind(x1)
y <- rnorm(3, mu, sigma)

# center x1
cx1 <- scale(x1, center = T, scale = F)

b_g <- gamboostLSS(y ~ bols(cx1),
                   control = boost_control(mstop = 10),
                   method = "noncyclic")

By applying the g_muvsi() function to the gamboostLSS object we can find that sigma is updated in the third boosting iteration.

> g_muvsi(b_g)
 [1] 1 2 2 1 2 2 1 2 2 1

However, if the negative gradient for mu and sigma after the second boosting iteration is caluclated manually

mstop(b_g) = 2
ngr_mu <- (1/exp(predict(b_g)$sigma)^2) * (y - predict(b_g)$mu)
ngr_si <- -1 + exp(-2 * predict(b_g)$sigma) * (y - predict(b_g)$mu)^2

fit them with linear regression

lm_mu <- lm(ngr_mu ~ cx1)
lm_si <- lm(ngr_si ~ cx1)

and finally calculate the empirical risk with step length 0.1

risk_mu <- -sum(dnorm(y, predict(b_g)$mu + .1 * lm_mu$fitted.values, exp(predict(b_g)$sigma), log = T))
risk_si <- -sum(dnorm(y, predict(b_g)$mu, exp(predict(b_g)$sigma + .1 * lm_si$fitted.values), log = T))

we'll get

> risk_mu
[1] 3.515798
> risk_si
[1] 3.517315

which means that mu should be updated in third iteration, but not sigma.

According to my debug analysis, I've found that the empirical risk for one distribution parameter in iteration [m] depends on which distribution parameter was updated in the previous iteration [m-1]. If, like this example, sigma is updated in the second iteration, then the risk for mu in the third iteration is calculated with the following formula

> but_risk_mu <- -sum(dnorm(y, predict(b_g)$mu, exp(predict(b_g)$sigma), log = T))
> but_risk_mu
[1] 3.609269

Comparing but_risk_mu and risk_si, the winner is sigma, just the same as the gamboostLSS output. Similarly, if mu is updated in iteration [m-1], then the risk for sigma in iteration [m] is not calculated like risk_si, but a value without step length updates

but_risk_si <- -sum(dnorm(y, predict(b_g)$mu, exp(predict(b_g)$sigma), log = T))

I think the problem appears somewhere between row 243 and 287 in https://github.com/boost-R/gamboostLSS/blob/f97c8906fc9d78e05cb28823789b12cd31c06fdc/R/mboostLSS.R#L243

FAUBZhang commented 5 years ago

@ja-thomas @ewaldmann