chjackson / flexsurv

The flexsurv R package for flexible parametric survival and multi-state modelling
http://chjackson.github.io/flexsurv/
53 stars 28 forks source link

Interval Censored Data #173

Closed zou-ims closed 7 months ago

zou-ims commented 7 months ago

Please see my example below. I used the development version. model1 has right censored data and ran successfully. model2 and model3 have interval censored; the parameter estimates are similar to model1, as expected, but had convergence issue. model4 failed when bhazard was added.

In function logLikFactory, event <- Y[,"status"] == 1

With the interval censored data like I had in my example, "status" has values 3, 0, 3, 0, ... Should that line be changed to include status=3 as well?

Thanks.

library(flexsurv) data<-data.frame( time=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15), death=c(1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1,0), death2=c(3,0,3,0,3,0,3,0,3,0,3,0,3,0,3,0,3,0,3,0,3,0,3,0,3,0,3,0,3,0), weight=c(18,3,26,2,29,33,29,33,30,29,29,27,28,25,26,23,24,20,21,17,18,16,16,13,13,11,10,9,8,7), bhazard=c(0.01,0.01,0.02,0.02,0.03,0.03,0.04,0.04,0.05,0.05,0.06,0.06,0.07,0.07,0.08,0.08, 0.09,0.09,0.1,0.1,0.11,0.11,0.12,0.12,0.13,0.13,0.14,0.14,0.15,0.15)) data$time2<-ifelse(data$death==1,data$time+1,Inf) data$time3<-data$time+0.5 model1<-flexsurvreg(Surv(time3,death)~1,data=data,weights=weight,dist="weibull") model2<-flexsurvreg(Surv(time=time,event=death2,time2=time2,type='interval')~1,data=data,weights=weight,dist="weibull") model3<-flexsurvreg(Surv(time,time2,type='interval2')~1,data=data,weights=weight,dist="weibull") model4<-flexsurvreg(Surv(time,time2,type='interval2')~1,data=data,weights=weight,bhazard=bhazard,dist="weibull")

model1<-flexsurvreg(Surv(time3,death)~1,data=data,weights=weight,dist="weibull") model2<-flexsurvreg(Surv(time=time,event=death2,time2=time2,type='interval')~1,data=data,weights=weight,dist="weibull") Warning message: In flexsurvreg(Surv(time = time, event = death2, time2 = time2, : Optimisation has probably not converged to the maximum likelihood - Hessian is not positive definite. model3<-flexsurvreg(Surv(time,time2,type='interval2')~1,data=data,weights=weight,dist="weibull") Warning message: In flexsurvreg(Surv(time, time2, type = "interval2") ~ 1, data = data, : Optimisation has probably not converged to the maximum likelihood - Hessian is not positive definite. model4<-flexsurvreg(Surv(time,time2,type='interval2')~1,data=data,weights=weight,bhazard=bhazard,dist="weibull") Error in dloglik[dead, ] - dscense : non-conformable arrays

chjackson commented 7 months ago

Thanks for this report. I've fixed the bug I think in https://github.com/chjackson/flexsurv/commit/f98312c21e637fe3fb96fb180e88be3e1904945e. The bug was in the code to calculate derivatives and second derivatives - it was failing when bhazard was specified and there were no uncensored event times.

loglikFactory was correct, I think, because event indicates the people whose event time is known exactly, hence the likelihood comes from the PDF at that time. For interval censored event times, the likelihood is calculated from the difference between the CDFs at time2 and time1. The comment at https://github.com/chjackson/flexsurv/blob/master/R/flexsurvreg.R#L54 was misleading though, and I've changed it in this fix.

zou-ims commented 7 months ago

Thanks. Has it been rebuilt? I got errors when I ran install.packages("devtools") # if devtools not already installed devtools::install_github('chjackson/flexsurv-dev')

As for bhazard, my understanding is that it means, conditional on being alive, probability of dying in one time unit for general population. In my example data, status 3 has time2-time=1; but status 0 has time2=Inf. I think only records with status 3 should be used in the calculation. Could you point me to the program where likelihood was calculated for the interval censored data? Thanks.

chjackson commented 7 months ago

It gets built on your machine if you install it this way. What errors did you get?

bhazard is not a probability but a rate - see the user guide page 6. Though these will be approximately the same if the rate is small.

The user guide page 2-3 is also where the likelihood is explained. In the source code, the function check.flexsurv.response processes the Surv() object, and loglikFactory calculates the likelihood.

In bhazard models it's not documented what is done when there is interval censoring. The code ignores bhazard for all kinds of censored observations (left, right or interval). I think that's what should happen, as the background hazard only comes into the likelihood as an offset to the hazard for the exactly-observed event times (generalising Nelson et al)

zou-ims commented 7 months ago

Now it worked. I had to remove a few packages manually.

Is there any thing that be done to improve modle2 and model3 above?

As for bhazard model and interval censored data, I think it would be an improvement if the program can treat (t1,t2) as event when t1 is not -Inf and T2 is not Inf. bhazard input would be, conditional on being alive at t1, probability of dying in (t1,t2) for general population/(t2-t1).

chjackson commented 7 months ago

There is a very small amount of data in these examples, particularly if there is heavy censoring - I suspect the likelihood function is awkward and there's not enough infomation to confidently identify the maximum. There's only 2 parameters, so you could try to plot the 2D likelihood surface and see what is going on.

zou-ims commented 7 months ago

Model1 is about the same model and fit the data well. I suspect there is something going on with the interval censored data. Thanks.

zou-ims commented 7 months ago

I think this is how interval censored data can be handled when bhazard is presented.

Let f1 and S1 be density and survival of general population. Let f2 and S2 be density and survival of the model.

  1. formulate data as Surv type='interval2'; so every record is an event between t1 and t2. Let bhazard be the probability of dying in the interval given being alive at the beginning of the interval.

  2. if t2=Inf: these records can be ignored. Conditional on being alive at t1, probability of dying in (t1,Inf) is 1 for both general population and the model.

  3. if t1>-Inf and t2<Inf: bhazard=(S1(t1)-S1(t2))/S1(t1). Use (S2(t1)-S2(t2))/S2(t1)/(1-bhazard) in the likelihood function.

  4. if t1=t2: assign a delta that is small relative to t1 and let the interval be (t1-delta, t1+delta). This becomes the same setup of 3 above.

  5. if t1=-Inf, bhazard=integral(from 0 to t2)(f1(t)(S1(t)-S1(t2))/S1(t))dt / (1-S1(t2)). Use integral(from 0 to t2)(f2(t)(S2(t)-S2(t2))/S2(t))dt / (1-S2(t2)) / (1-bhazard) in the likelihood function.

chjackson commented 7 months ago

I had just been working on this yesterday! I agree, the logic in the package for background hazard models wasn't valid with interval or left censoring. The background survival probability didn't cancel nicely like it did when there was only right censoring.

I've fixed this I think in the latest commit. For times are left or interval censored, the corresponding bhazard should contain the probability of surviving the interval, conditionally on being alive at the start of the interval. So your model4 should give estimates in line with the other models if called as

model4 <- flexsurvreg(Surv(time,time2,type='interval2')~1,data=data,
                      weights=weight,bhazard=exp(-bhazard*(time2-time)),dist="weibull")

Suppose the survival is $S(t) = S_0(t) R(t)$ where $S_0(t)$ is the background survival function and $R(t)$ is the relative survival. The likelihood for an individual with an event time that is interval censored on $(t^{min},t^{max})$ is calculated as

$$\log(S(t^{min}) - S(t^{max})) = \log(S_0(t^{min})R(t^{min}) - S_0(t^{max})R(t^{max}))$$

$$ = \log(S_0(t^{min})) + \log(R(t^{min}) - S_0(t^{max})/S_0(t^{min})R(t^{max})) $$

$\log(S_0(t^{min}))$ can be ignored when maximising the likelihood, as it doesn't depend on the parameters. The conditional survival over the interval, $S_0(t^{max})/S_0(t^{min})$ is what gets supplied as bhazard in such cases. With right-censoring, this is 1. This is coded up in loglikFactory though in terms of $p() = 1-S()$ - see how the likelihood term gets modified by the conditional survival, labelled b_condsurv.

The only thing missing is explicit handling of $t^{min}=t^{max}$. I'll add this. I think bhazard should contain the actual hazard in these cases, then there is no need to define a $\delta$.

chjackson commented 7 months ago

For zero width intervals, Surv() automatically assigns these cases a status of 1. flexsurvreg will expect bhazard to contain the background hazard in these cases.

zou-ims commented 7 months ago

I am not sure if both probability and density can be in the same likelihood function. That's why I introduced delta.

For the left censored data, it's more complicated. The equation above will treat (-Inf, t2) as (0,t2).

chjackson commented 7 months ago

I don't understand what you mean on either of those points. The code in logLikFactory separates out the individuals with observed and censored event times, so it is fine for bhazard to contain a mixture of hazards and probabilities. What equation are you referring to? In flexsurv models, the survival time is never allowed to be less than zero, so the CDF p(-Inf) = p(0) = 0 for any model.

zou-ims commented 7 months ago

My second point: will a record of (-Inf, t2) be treated the same as (0, t2)? I think they were treated the same in the code but they meant different things.

For the first one, I just haven't seen a likelihood function that have both probability and density in it. It's either all probabilities (discrete) or all densities (continuous). The numbers are in different scale. I am not sure if they can be mixed together.

chjackson commented 7 months ago

The standard likelihood for a survival model mixes densities (for observed event times) and probabilities (for censored times).

On your second point - if these indicate the intervals for interval-censored records in a survival model, then they will always have the same probability, since the probability of being in (-Inf, 0) is always zero.

zou-ims commented 7 months ago

Agree with the first point, "The standard likelihood for a survival model mixes densities (for observed event times) and probabilities (for censored times)."

For the left censored data, how bhazard should be specified? The maximum interval is t2, but the actual is less.

chjackson commented 7 months ago

bhazard would be specified in the same way as with interval censored data: the probability of surviving the interval for someone alive at the start, $S0(t{max})/S0(t{min})$. With left censoring, the interval is ($t{min}=0$, $t{max}$), so the probability of being alive at the start is 1, so bhazard would be the unconditional probability of surviving the interval.

zou-ims commented 7 months ago

Then left censored would be treated the same as censored in (0, t_max). I think that would introduce bias.

Thanks Chris for all of your work. Luckily my current data only ha interval and right censored data. I will test current version and provide feedback.

chjackson commented 7 months ago

Could you please explain why you think there would be bias, given that survival times cannot be negative?

I am grateful to you for opening this issue and recognising the original problem. I would like you to be happy with the solution before I close the issue. If you think there is still a problem, could you please read through my previous responses, and present a clearly-argued case?

zou-ims commented 7 months ago

Does this help? It's easier for me to write SAS code. a is interval censored between 0 and 1. Assuming uniform(0 to 1). b is right censored (-Inf, 1). Unobserved t1 is set to also follow uniform(0 to 1).

Noter that and b have different means.

data a; do i=1 to 1000; a=ranuni(1); cutoff=ranuni(1); if a>cutoff then b=a; else b=.; output; end; run;

proc means; var a b; run;

Variable N Mean a 1000 0.5069828 b 512 0.6807934

chjackson commented 7 months ago

You've shown that truncating a random variable changes its mean. OK. I agree. But this does not have a clear connection to the original issue, or any of the previous comments in this thread. Also it is not completely correct - b is not "right censored on (-Inf,1)" because it is impossible for b to take values that are < 0.

I am now closing this issue, because as far as I can tell, the problem has been fixed, and this conversation is no longer constructive. If you can see an example where flexsurv does not do what you expect, please could you open a new issue that includes

(a) a minimal, reproducible example with code and data (b) a clear explanation of what output you should expect and why.

Or if the documentation does not explain clearly what the software does, then please point out where.

zou-ims commented 7 months ago

Re-opening this again instead of opening a new issue. Chris, as you suggested, model4 <- flexsurvreg(Surv(time,time2,type='interval2')~1,data=data, weights=weight,bhazard=exp(-bhazard(time2-time)),dist="weibull") This seems to be working numerically. However, input of bhazard (=exp(-bhazard(time2-time))) is not prob of dying in the interval. It's actually surviving the interval. The user should provide 1-exp(-bhazard*(time2-time)).

Since it's numerically correct, the program seems to be taking the input of bhazard as prob of surviving the interval. Since it's called bhazard, it should be 1- prob of surviving the interval.

chjackson commented 7 months ago

Well, it's a probability, not a hazard, and it was behaving as documented in help(flexsurvreg). Though I can see the argument that a probability of death is closer to the concept of a hazard than a probability of survival is, so I'm happy to change this (https://github.com/chjackson/flexsurv/commit/99a3b85b6e74715cce5f24be1e53d8b33a6217a8).

zou-ims commented 7 months ago

Thanks. It' working as expected now.