boost-R / mboost

Boosting algorithms for fitting generalized linear, additive and interaction models to potentially high-dimensional data. The current relase version can be found on CRAN (http://cran.r-project.org/package=mboost).
73 stars 27 forks source link

cvrisk for CoxPH models broken #85

Closed hofnerb closed 6 years ago

hofnerb commented 7 years ago

Using the cross-validation approach for Cox PH models as described in Verweij and van Houwelingen (1993) (cvrisk(..., correted = TRUE)) does not seem to work:

## use CoxFlexBoost to simulate survival times
install.packages("CoxFlexBoost", repos="http://R-Forge.R-project.org")
library("CoxFlexBoost")

## load further packages
library("devtools")
install_github("boost-R/mboost")  ## use most recent mboost to allow leave-one-out CV
library("mboost")
library("survival")

### sample data
set.seed(1234)
## sample covariates first
X <- matrix(NA, nrow=400, ncol=3)
X[,1] <- runif(400, -1, 1)
X[,2] <- runif(400, -1, 1)
X[,3] <- runif(400, -1, 1)

## time-constant hazard rate (needs to be a vectororized in the first argument time)
lambda <- function(time, x){
    exp(0 * time + 0.7 * x[1] + 0.2 * x[2] - 0.2 * x[3])
}

## specify censoring function
cens_fct <- function(time, mean_cens){
    ## censoring times are independent exponentially distributed
    censor_time <- rexp(n = length(time), rate = 1/mean_cens)
    event <- (time <= censor_time)
    t_obs <- apply(cbind(time, censor_time), 1, min)
    ## return matrix of observed survival times and event indicator
    return(cbind(t_obs, event))
}

## simulate data with above hazard function and censoring function
data <- rSurvTime(lambda, X, cens_fct, mean_cens = 5)
## add another 20 non-informative variables
data <- cbind(data, as.data.frame(matrix(rnorm(400*20), ncol = 20)))

## estimate standard CoxPH model
summary(coxph(Surv(time, event) ~ ., data))
## coefficients seem ok, simulation worked

### now boosting model
fm <- as.formula(paste("Surv(time, event) ~", paste(names(data)[-(1:2)], collapse = "+")))
mod <- glmboost(fm, data, family = CoxPH())
coef(mod)

## cvrisk not working (here leave-one-out CV)
cvr <- cvrisk(mod, folds = cv(model.weights(mod), type = "kfold", B = 400), grid = 1:200)
plot(cvr) 
mstop(cvr)
## does not seem sensible (as it stops immediately)

## using negative of cvr also not useful (as it does not seem to overfit at all)
plot(-cvr)
mstop(-cvr)

grafik

Using the "uncorrected" version seems fine

cvr2 <- cvrisk(mod, corrected = FALSE)
plot(cvr2)
mstop(cvr2)

grafik

hofnerb commented 7 years ago

[Update]

Leave-one-out crossvalidation also does not work with corrected = FALSE, i.e., esentially, the paper is "correct" as it only deals with leave-one-out crossvalidation.

cvr <- cvrisk(mod, folds = cv(model.weights(mod), type = "kfold", B = 400), 
                    grid = 1:200, corrected = FALSE)
plot(cvr)

grafik

hofnerb commented 7 years ago

Here the relevant code: https://github.com/boost-R/mboost/blob/b6c6827728a374c47d42697a621ab10fef93c06d/R/crossvalidation.R#L40-L68

The dummyfct is the (most) relevant function.

hofnerb commented 7 years ago

We are trying to estimate

grafik

where the grafik (Figures are screenshots from Verweij and van Houwelingen).

Hence, we refit the model without the out-of-bag (oobag) observations (line 54) and compute the inbag-risk (see line 66), i.e. the negative log-likelihood obtained without the oobag observations. This should equal to the negative of grafik evaluated at grafik. (For this reason I think we need to compute the "inbag" risk, as we want to compute the log-likelihood based on the subsample used to fit the model.) Consequently, we need -risk(mod) in order to obtain grafik evaluated at grafik.

What is missing is the complete likelihood, i.e., for all data, evaluated again at the estimate grafik. Hence, we need the negative loss function for the whole data evaluated at the relevant estimates (or predictions relating to the estimates, see line 58). Note that predict(mod, aggregate = "cumsum") also makes predictions for observations with weight 0, i.e., for the oobag observations. The loss is computed in line 63. Here we relate all observed outcomes (object$response) to all predictions based on the model without the oobag observations (pr --> f).

In summary, I would conclude that we need to compute the sum of these two quantities with correct signs. However, no combination of signs shows an appropriate behavior. Actually, I think that line 66 should be

lplk + mod$risk()[grid]

to obtain the cvl criterion, and as we need a loss function rather than a likelihood, we should use -cvl, i.e., line 66. However, +cvl lead to a decreasing function in contrast to Table 1 in Verweij and van Houwelingen, where cvl increases with additional variables until a certain point and then starts to decrease again (maximum is highlighted in yellow): grafik

hofnerb commented 6 years ago

@mschmid25 correctly stated that it should of course read

- lplk + mod$risk()[grid]

However, this fix also does not solve the issue. The functionality thus will eventually be removed.