chjackson / flexsurv

The flexsurv R package for flexible parametric survival and multi-state modelling
http://chjackson.github.io/flexsurv/
53 stars 28 forks source link

Repeated measures for time-dependent covariates #176

Open mengyi-git opened 6 months ago

mengyi-git commented 6 months ago

Hi @chjackson,

I'm currently working with the flexsurvreg() function from your package, which offers great features for dealing with time-dependent covariates. However, I've encountered an issue related to repeated observations from the same individual.

In the aftreg() function from the eha package, there's an 'id' argument that tracks repeated observations from the same individual. It seems that the flexsurvreg() function does not consider scenarios where a single individual contributes multiple observations. When I run the code with both flexsurvreg() and aftreg() (omitting the 'id' argument in the latter), the results are identical. This leads me to believe that flexsurvreg() might not be accounting for repeated measures.

Is there an existing feature in flexsurvreg() that I might have missed, which allows for handling repeated observations from the same individual? If not, would it be possible to consider adding such functionality?

Below are the code snippets I used for comparison:

library(flexsurv)
library(eha)

# Code snippet for flexsurvreg
flexsurvreg(Surv(enter - 60, exit - 60, event) ~ sex + civ + region, 
            data = oldmort, dist = "weibull")

# Code snippet for aftreg without 'id' argument
aftreg(Surv(enter - 60, exit - 60, event) ~ sex + civ + region,
       data = oldmort, dist = "weibull")

# Code snippet for aftreg with 'id' argument
aftreg(Surv(enter - 60, exit - 60, event) ~ sex + civ + region,
       data = oldmort, id = id, dist = "weibull")

I appreciate any guidance or updates you can provide on this matter.

chjackson commented 6 months ago

You can include repeated observations from the same individual in flexsurvreg, just as you did by removing id when calling aftreg. This is still a valid model, but the limitation is that it doesn't account for potential correlation between observations from the same individual. I don't currently have any plans to add this functionality, but I will leave this issue open for discussion. It's not immediately clear to me from looking at aftreg's documentation what adding the id argument does, how/if this accounts for the correlation.

mengyi-git commented 6 months ago

Hi @chjackson,

Thank you for your response. I'd like to delve a bit deeper into this, setting aside the issue of correlation between observations from the same individual for the moment.

My primary focus now is on understanding how the nature of observations (whether they're from the same individual or different individuals) influences the log-likelihood function, especially in the context of time-dependent covariates.

Referencing the log-likelihood function given by Equation (4) in Zeng and Lin (2007), this equation includes a cumulative hazard function as part of the summand. My understanding is that if there are multiple observations from the same individual, these would be incorporated within the integral of the cumulative hazard function. Conversely, if the observations are from different individuals, this would result in the addition of separate cumulative hazard functions.

However, I acknowledge that my understanding might be incorrect or incomplete. Therefore, I'm particularly interested in how flexsurvreg() manages these scenarios. Does the function differentiate between observations from the same individual and different individuals in its calculation of the log-likelihood, especially with time-dependent covariates?

Thank you once again for your time and insights.

chjackson commented 6 months ago

Thanks for the clarification. flexsurvreg() currently has no notion of subject IDs. If the data are supplied in counting process format (e.g. Surv(start,stop,status)) then it will use the corresponding covariate value in this interval to calculate the probability of survival up to start, the term labelled $S_i(s_i)$ in the likelihood explained on page 3 of the user guide. So you are right, that will not be correct in general if there are time dependent-covariates that change between 0 and start.

I think I will work on doing this properly for the next version, so that this probability is formed from the cumulative hazard by accounting for covariate changes within a subject.

chjackson commented 6 months ago

I'm having second thoughts about implementing this. The likelihood for time-dependent covariates would be easy enough to implement. But the question remains what to do about prediction.

Simply estimating coefficients of time-dependent covariates may be of use to some. But I expect most people would expect to be able to predict from the model (e.g. survival probabilities, mean survival times). Users would need to specify in advance how covariates will change for the predicted individual, e.g. constant over a series of intervals. That'd be a substantial development, as most of the output functions would need to be modified.

I think I'd like to get a better idea what the use case for these kinds of models would be. I can see the use for demographic / chronic disease applications where we want risk to vary with both age and calendar time. Modelling covariates that change randomly (as opposed to predictably) I'd expect would be better done with joint longitudinal / time-to-event models.

I'll leave this issue open for discussion.

GDPOG commented 3 months ago

Hello Christopher,

I just come across this thread by chance, which draws my interest extremely. I had similar question regarding how does flexsurv deal with the time-varying covariate in fitting (or calculating the likelihood value effectively) in the past. I noticed that the convention Surv(start, stop, status) used in the survival package is kept here. Hence, I assumed that it follows the same concept in terms of calculating the likelihood and assumed the handling of time-varying covariate in flexsurv is correct for the parametric hazard model.

To be more specific, the cumulative hazard used in likelihood function (indirectly) will be the sum of a series of integral of hazard function over consecutive time intervals , each of which is depending on the covariate values within corresponding time interval. Clearly, the cumulative hazard should be reset to zero and the aforementioned integration should be restart when the calculation moves to a new subject.

In a COX-PH model, integration of hazard function is not needed for the partial likelihood function, as the integral of baseline hazard function is canceled out. However, the contribution of repeatedly measured (longitudinal) covariate from one subject to the denominator of partial likelihood function would not be the same for different ordered true event time (or uncensored event time) . Please excuse for my very abstract description without using equations. A detailed example is provided in the book Statistical Methods for Survival Data Analysis 3rd edition authored by ELISA T. LEE and JOHN WENYUWANG (page 344).

After reading your conversations, I fear my assumption about the calculation of likelihood by flexsurv was wrong. In any case, it would be good to understand how flexsurv calculates the likelihood function for each record in a dataset prepared like the following

id start stop status cov_t 1 0 1 0 2 1 1 3 0 4 1 3 5 1 4

In one of my analysis with flexsurv, I modelled a log-normal dist with a time-varying covariate on the parameter meanlog(mu). Surprisingly, the fitting was much "smoother" than I expected in comparison to the fitting with another software tool, which uses the numerical solver for ODEs to get the likelihood function. To understand the reason, I did some investigation of the flexsurv package and noticed it uses the PHI function in the math library of stats package in R to calculate the F(t), which is based on a "Rational Chebyshev approximation approach."

So I thought this is the reason that flexsurv did not encounter numerical difficulties during the optimization caused by sharp change in values of time-varying covariate, which can lead to stiff ODEs for numerical solver. But now, I wonder if the PHI function was used for each time interval of a subject with different meanlog values by the flexsurv. Maybe, you can provide me some more insights about it.

Regarding your question about the use case of parametric hazard model with time-varying covariate, I'd like to share some of my unmatured thoughts as well.

Firstly, there is well-known immortal time bias problem in survival analysis, predominantly in the COX-PH context. However, this bias can also appear in the parametric hazard model if the time-varying covariate is improperly added as time-fixed (or landmark) covariate to baseline hazard function.

Secondly, it is not always needed (nor possible) to describe the covariate with a function of time. So the integrated longitudinal + time-to-event model may very likely stay in concept.

Yes, the purpose of building a model (particularly with parametric hazard) is to predict. Without a longitudinal model, I would still see the value of predicting survival or the "expected survival" given a couple of repeated measurements of time-varying covariate instead of just one "baseline value" serving as prognostics. Here, I would refer to the "Peripheral arterial disease" case in Analysis of time-to-event for observational studies: Guidance to the use of intensity models. Stat Med. 2021;40(1):185-211, despite of the use of Cox-PH model in their paper.

Admittedly, it would be much more efforts if you would like to expand the flexsurv package to handle some more complicated models, i.e. ODE + time-to-event. But, I believe the capability of handling time-varying covariate properly for all distributions is a very valuable feature for modeler working with survival data.

Chenguang Wang

chjackson commented 3 months ago

In each row of the dataset, flexsurv will use a cumulative hazard obtained by assuming the hazard is constant between time zero and start. Individual ids are ignored. As you suggest, I don't think this is valid in general if there are time-dependent covariates.

Though it is possible that it may be valid in special cases. I have seen it suggested that it is OK for proportional hazards models, but I haven't seen the theory behind this argument. I can't access the book you refer to in my library. flexsurv uses fully parametric models, so any theory developed for Cox models and partial likelihoods will not necessarily apply.