boost-R / gamboostLSS

Boosting models for fitting generalized additive models for location, shape and scale (GAMLSS) to potentially high dimensional data. The current relase version can be found on CRAN (https://cran.r-project.org/package=gamboostLSS).
26 stars 11 forks source link

Scoping problem with "combined_risk" in nc_mboostLSS #49

Closed Almond-S closed 6 years ago

Almond-S commented 6 years ago

attr(model1, "combined_risk") seems to always relate to the last fitted model instead of model1. And perhaps related to this issue, the combined_risk cannot be returned when fitting multiple models with mclapply:

library(gamboostLSS)
###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
### Scoping problem with "combined_risk" in noncyclical mboostLSS 
### Particularly, prohibiting the access to "combined_risk" after fitting multiple models via mclapply 
###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###

### Use first example of ?mboostLSS but with mclapply over the methods cyclic and noncyclic:

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
### Data generating process:
set.seed(1907)
x1 <- rnorm(1000)
x2 <- rnorm(1000)
x3 <- rnorm(1000)
x4 <- rnorm(1000)
x5 <- rnorm(1000)
x6 <- rnorm(1000)
mu    <- exp(1.5 +1 * x1 +0.5 * x2 -0.5 * x3 -1 * x4)
sigma <- exp(-0.4 * x3 -0.2 * x4 +0.2 * x5 +0.4 * x6)
y <- numeric(1000)
for( i in 1:1000)
  y[i] <- rnbinom(1, size = sigma[i], mu = mu[i])
dat <- data.frame(x1, x2, x3, x4, x5, x6, y)
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

### linear models with y ~ . for both components: 20 boosting iterations fitted via mclapply
model <- mclapply(c("cyclic", "noncyclic"), function(method) glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat, 
                                                                       control = boost_control(mstop = 20),method = method, center = TRUE))
# only consider the noncyclic model[[2]] (as the cyclic does not have "combined_risk") 
model_nc <- model[[2]]

# Cannot access combined_risk
try(attr(model_nc, "combined_risk")())

### If there was another model previously fitted, the combined_risk of this other model is returned instead:
# e.g. model instead x1 as response and mstop = 30  
model_0 <- mboostLSS(x1 ~ ., families = GaussianLSS(), data = dat, 
                     control = boost_control(mstop = 30),method = "noncyclic")

try(length(attr(model_nc, "combined_risk")()))
# => combined_risk of model_0 (of length 30+2) is returned instead of the one from model_nc (length 20+2)

### And if we now fit model_nc seperately, the combined_risk of model_0 is replaced !!! 
model_1 <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat, 
                       control = boost_control(mstop = 20),method = "noncyclic", center = TRUE)

length(attr(model_0, "combined_risk")())
# => now, length is 22 !
hofnerb commented 6 years ago

@ja-thomas can you please look at this?

ja-thomas commented 6 years ago

Problem is here:

https://github.com/boost-R/gamboostLSS/blob/a462d46caca91fe1869c8c6d82c379df469647f2/R/mboostLSS.R#L286

I'm not sure why I had to write this (horrible) line... I need to check

ja-thomas commented 6 years ago

Hmmmm, I played around with this some more. I kind of remember why I did it. Since assignment for the parent.frame was not working correctly and gamboostLSS is writing to the .GlobaleEnv anyways.

This is pretty bad design from my part, but I didn't find a solution at the time and unfortunately I'm still not sure how to find the correct environment to write the combined_risk to. This seems to have to do with the fact that we assign functions (with environment) to attributes of objects.

tldr.: This sucks, but I have no idea how to fix it :(

hofnerb commented 6 years ago

Thanks @ja-thomas for digging into this.

Without having had another look I think the issue is that some functions are defined outside of the relevant environment and hence do not share the same parent environments. Thus, the combined_risk is written to the wrong place.

Of note, I do not think that gamboostLSS writes to the global environment in other places. We define global variables but only ro get rid of NOTES in R CMD check. Actually, these variables should be defined at the time they are used inside another environment than the global environment.

@Almond-S I am not sure if we should really close this issue. It still isn't solved. Of course it has not the highest priority but it should stay on the list of open issues. Perhaps I'll have some spare time on a train ride can look into this (or someone else has a good idea how to fix this).

hofnerb commented 6 years ago

With @ja-thomas hints that was easy to fix. combined_risk simply needed to be defined within mboostLSS_fit. I'll add some checks to see if it really works and close the issue afterwards.

Almond-S commented 6 years ago

Thank you both so much!

I had all my simulation studies showing apparently corrupt crossvalidation results, which did not fit to the results with prior gamboostLSS versions, and think that this should have been the problem. And this was kind of the last issue prohibiting me from finishing a project on FDboostLSS for functional responses.

So many thanks again!!

ja-thomas commented 6 years ago

Thanks @hofnerb you're a genius :)

hofnerb commented 6 years ago

You are more than welcome. Thanks for locating the reason in the first place!

@Almond-S Is it currently sufficient for you to have the bug fixed on github or should we make a quick release to CRAN as well?

Almond-S commented 6 years ago

I think, having it fixed on gitHub should be completely fine. But if it won't, for whatever reason, I'd get back to you. Thanks again!