cvrisk for CoxPH models broken

hofnerb commented 7 years ago

Using the cross-validation approach for Cox PH models as described in Verweij and van Houwelingen (1993) (cvrisk(..., correted = TRUE)) does not seem to work:

## use CoxFlexBoost to simulate survival times
install.packages("CoxFlexBoost", repos="http://R-Forge.R-project.org")
library("CoxFlexBoost")

## load further packages
library("devtools")
install_github("boost-R/mboost")  ## use most recent mboost to allow leave-one-out CV
library("mboost")
library("survival")

### sample data
set.seed(1234)
## sample covariates first
X <- matrix(NA, nrow=400, ncol=3)
X[,1] <- runif(400, -1, 1)
X[,2] <- runif(400, -1, 1)
X[,3] <- runif(400, -1, 1)

## time-constant hazard rate (needs to be a vectororized in the first argument time)
lambda <- function(time, x){
    exp(0 * time + 0.7 * x[1] + 0.2 * x[2] - 0.2 * x[3])
}

## specify censoring function
cens_fct <- function(time, mean_cens){
    ## censoring times are independent exponentially distributed
    censor_time <- rexp(n = length(time), rate = 1/mean_cens)
    event <- (time <= censor_time)
    t_obs <- apply(cbind(time, censor_time), 1, min)
    ## return matrix of observed survival times and event indicator
    return(cbind(t_obs, event))
}

## simulate data with above hazard function and censoring function
data <- rSurvTime(lambda, X, cens_fct, mean_cens = 5)
## add another 20 non-informative variables
data <- cbind(data, as.data.frame(matrix(rnorm(400*20), ncol = 20)))

## estimate standard CoxPH model
summary(coxph(Surv(time, event) ~ ., data))
## coefficients seem ok, simulation worked

### now boosting model
fm <- as.formula(paste("Surv(time, event) ~", paste(names(data)[-(1:2)], collapse = "+")))
mod <- glmboost(fm, data, family = CoxPH())
coef(mod)

## cvrisk not working (here leave-one-out CV)
cvr <- cvrisk(mod, folds = cv(model.weights(mod), type = "kfold", B = 400), grid = 1:200)
plot(cvr) 
mstop(cvr)
## does not seem sensible (as it stops immediately)

## using negative of cvr also not useful (as it does not seem to overfit at all)
plot(-cvr)
mstop(-cvr)

grafik

Using the "uncorrected" version seems fine

cvr2 <- cvrisk(mod, corrected = FALSE)
plot(cvr2)
mstop(cvr2)

grafik

hofnerb commented 7 years ago

[Update]

Leave-one-out crossvalidation also does not work with corrected = FALSE, i.e., esentially, the paper is "correct" as it only deals with leave-one-out crossvalidation.

cvr <- cvrisk(mod, folds = cv(model.weights(mod), type = "kfold", B = 400), 
                    grid = 1:200, corrected = FALSE)
plot(cvr)

grafik

hofnerb commented 7 years ago

Here the relevant code: https://github.com/boost-R/mboost/blob/b6c6827728a374c47d42697a621ab10fef93c06d/R/crossvalidation.R#L40-L68

The dummyfct is the (most) relevant function.

hofnerb commented 7 years ago

We are trying to estimate

grafik

where the grafik (Figures are screenshots from Verweij and van Houwelingen).

Hence, we refit the model without the out-of-bag (oobag) observations (line 54) and compute the inbag-risk (see line 66), i.e. the negative log-likelihood obtained without the oobag observations. This should equal to the negative of grafik evaluated at grafik . (For this reason I think we need to compute the "inbag" risk, as we want to compute the log-likelihood based on the subsample used to fit the model.) Consequently, we need -risk(mod) in order to obtain grafik evaluated at grafik .

What is missing is the complete likelihood, i.e., for all data, evaluated again at the estimate grafik . Hence, we need the negative loss function for the whole data evaluated at the relevant estimates (or predictions relating to the estimates, see line 58). Note that predict(mod, aggregate = "cumsum") also makes predictions for observations with weight 0, i.e., for the oobag observations. The loss is computed in line 63. Here we relate all observed outcomes (object$response) to all predictions based on the model without the oobag observations (pr --> f).

In summary, I would conclude that we need to compute the sum of these two quantities with correct signs. However, no combination of signs shows an appropriate behavior. Actually, I think that line 66 should be

lplk + mod$risk()[grid]

to obtain the cvl criterion, and as we need a loss function rather than a likelihood, we should use -cvl, i.e., line 66. However, +cvl lead to a decreasing function in contrast to Table 1 in Verweij and van Houwelingen, where cvl increases with additional variables until a certain point and then starts to decrease again (maximum is highlighted in yellow): grafik

hofnerb commented 6 years ago

@mschmid25 correctly stated that it should of course read

- lplk + mod$risk()[grid]

However, this fix also does not solve the issue. The functionality thus will eventually be removed.

boost-R / mboost

cvrisk for CoxPH models broken #85