chjackson / flexsurv

The flexsurv R package for flexible parametric survival and multi-state modelling
http://chjackson.github.io/flexsurv/
55 stars 27 forks source link

More methods needed for residuals.flexsurvreg #78

Open chjackson opened 4 years ago

chjackson commented 4 years ago

The current implementation only supports difference between observed survival and predicted mean, which ignores censoring, so is of limited use. The survival package seems to do something more considered for residuals.survreg.

It'd be simple to add deviance-type residuals, as flexsurvreg returns a component $logliki giving each individual's contribution to the maximised log-likelihood.

I sometimes get asked about Cox-Snell residuals, but I'm not familiar with those.

@mattwarkentin are you familiar with any of these methods?

mattwarkentin commented 4 years ago

I definitely agree that residuals.flexsurvreg needs much better support for common survival-type residuals. I am modestly familiar with the types supported by residuals.coxph and residuals.survreg. It would be nice to build toward that kind of support.

As you mention, I think deviance-type residuals shouldn't be too tricky to implement. I also think dfbeta and dfbetas are simple enough, if not potentially computationally taxing. Otherwise, I am interested in digging more into Martingale, Schoenfeld, and Cox-Snell residuals to see what can be supported.

mattwarkentin commented 4 years ago

I was flipping through these slides, and in particular slide 4, which defines the Cox-Snell residual as: Screen Shot 2020-06-10 at 5 31 57 PM

Am I mistaken, or is this the value returned by predict(x, type = "cumhaz") when .time is equal to a participants time t? In other words, the value of the model-predicted cumulative hazard function for participant i at time ti.

If this is true, the Martingale residual follows from the above equation as (slide 13): Screen Shot 2020-06-10 at 5 38 31 PM

Where di is the value of the failure code (0 or 1) and ei is the Cox-Snell residual.

And lastly, the deviance residual follows as (slides 18-19): Screen Shot 2020-06-10 at 5 39 41 PM Where mi is the martingale residual, li is the individual contribution to the log-likelihood, and ~li is the maximum possible log-likelihood.

Does this make some sense? I turn to your expertise.

chjackson commented 4 years ago

That's a nice clear set of slides, thanks for finding. That all makes sense. That definition of the Cox-Snell residual assumes a proportional hazards model, but like the slides say, that can be generalised to the fitted individual-specific cumulative hazards for any model, which are compared to Exp(1) to assess the model.

mattwarkentin commented 4 years ago

No problem! This has been a good learning experience for me.

I've enjoyed the process of contributing to this package, so I am interested in helping out with this issue too, if you are interested in my help. However, I want to be mindful of not overstaying my welcome, so to speak.

chjackson commented 4 years ago

Go for it - I'm happy to have you contribute.

heor-robyoung commented 1 year ago

Looks to me like the CS residuals are off for censored observations. Have a look at these modifications of the "true model" scenario:

set.seed(42)
y <- rweibull(10000, 2, 2)
fite <- flexsurvreg(Surv(y) ~ 1, dist="weibull")
cs <- coxsnell_flexsurvreg(fite)
plot(cs$"(qexp)", cs$est, pch=19, xlab="Theoretical quantiles", ylab="Cumulative hazard")
abline(a=0,b=1,col="red",lwd=2)

# Simple cut-off
max_time <- 2 * median(y)
cens_time <- max_time

cens <- y > cens_time
y2 <- ifelse(cens, cens_time, y)
fitf <- flexsurvreg(Surv(y2, 1*(!cens)) ~ 1, dist = "weibull")
cs <- coxsnell_flexsurvreg(fitf)
plot(cs$"(qexp)", cs$est, pch=19, xlab="Theoretical quantiles", ylab="Cumulative hazard")
abline(a=0,b=1,col="red",lwd=2)

# Random censoring
censor_rate <- 1 / median(y)
cens_time <- rexp(length(y), censor_rate)
cens <- y > cens_time
y2 <- ifelse(cens, cens_time, y)
fitf <- flexsurvreg(Surv(y2, 1*(!cens)) ~ 1, dist = "weibull")
cs <- coxsnell_flexsurvreg(fitf)
plot(cs$"(qexp)", cs$est, pch=19, xlab="Theoretical quantiles", ylab="Cumulative hazard")
abline(a=0,b=1,col="red",lwd=2)

The suggestion here: https://uu.diva-portal.org/smash/get/diva2:826234/FULLTEXT01.pdf - equation 2.1.4 - is to add the expected value (1) to censored observations. I'm not sure it's a happy solution for the heavily censored case, but it does at least shift the line back onto mean = 1 in this "perfect" case.

chjackson commented 1 year ago

I'm having trouble following the theory behind this modification. Why would we favour adding any specific number to the residual for censored observations? What is meant by making them "compatible"? In general I wouldn't expect censored observations to tell us much useful about the goodness of fit of the model, e.g as the censoring time gets earlier they don't tell us anything.

heor-robyoung commented 1 year ago

Censored information can be highly informative about fit, particularly if you have decreasing hazard and a lot of administrative censoring, such that your Kaplan - Meier is going through an extended plateau up to maximum follow-up. If the model predicts high density in this plateau period when you have evidence to the contrary in the form of censored observations at greater times, that's poor fit.

Removing the censored observations outright will cause issues of its own, as the remaining obs won't be unit exponential (unless censoring is somehow completely independent of event time, which is only going to happen if someone goes back into a dataset and randomly picks observations to censor - otherwise it is inevitable that longer survival times are more likely to be censored). So they need to be incorporated somehow, and the current implementation gives a clear bias to the residual plot.

Best,

Rob