Closed hofnerb closed 6 years ago
[Update]
Leave-one-out crossvalidation also does not work with corrected = FALSE
, i.e., esentially, the paper is "correct" as it only deals with leave-one-out crossvalidation.
cvr <- cvrisk(mod, folds = cv(model.weights(mod), type = "kfold", B = 400),
grid = 1:200, corrected = FALSE)
plot(cvr)
Here the relevant code: https://github.com/boost-R/mboost/blob/b6c6827728a374c47d42697a621ab10fef93c06d/R/crossvalidation.R#L40-L68
The dummyfct is the (most) relevant function.
We are trying to estimate
where the (Figures are screenshots from Verweij and van Houwelingen).
Hence, we refit the model without the out-of-bag (oobag) observations (line 54) and compute the inbag-risk (see line 66), i.e. the negative log-likelihood obtained without the oobag observations. This should equal to the negative of evaluated at . (For this reason I think we need to compute the "inbag" risk, as we want to compute the log-likelihood based on the subsample used to fit the model.) Consequently, we need -risk(mod)
in order to obtain evaluated at .
What is missing is the complete likelihood, i.e., for all data, evaluated again at the estimate . Hence, we need the negative loss function for the whole data evaluated at the relevant estimates (or predictions relating to the estimates, see line 58). Note that predict(mod, aggregate = "cumsum")
also makes predictions for observations with weight 0, i.e., for the oobag observations. The loss is computed in line 63. Here we relate all observed outcomes (object$response
) to all predictions based on the model without the oobag observations (pr
--> f
).
In summary, I would conclude that we need to compute the sum of these two quantities with correct signs. However, no combination of signs shows an appropriate behavior. Actually, I think that line 66 should be
lplk + mod$risk()[grid]
to obtain the cvl criterion, and as we need a loss function rather than a likelihood, we should use -cvl, i.e., line 66. However, +cvl lead to a decreasing function in contrast to Table 1 in Verweij and van Houwelingen, where cvl increases with additional variables until a certain point and then starts to decrease again (maximum is highlighted in yellow):
@mschmid25 correctly stated that it should of course read
- lplk + mod$risk()[grid]
However, this fix also does not solve the issue. The functionality thus will eventually be removed.
Using the cross-validation approach for Cox PH models as described in Verweij and van Houwelingen (1993) (
cvrisk(..., correted = TRUE)
) does not seem to work:Using the "uncorrected" version seems fine