CamDavidsonPilon / lifelines

Survival analysis in Python
lifelines.readthedocs.org
MIT License
2.37k stars 560 forks source link

Prediction for time-varying covariates #38

Closed CamDavidsonPilon closed 7 years ago

CamDavidsonPilon commented 10 years ago

It's not completely clear to me how to do prediction with time-varying covariates. By prediction, I am referring to constructing a hazard curve give the covariates. Consider the static case:

_hz(t) = b1(t)X_1 + b2(t)X2 ...

This works as I can extend t to as far as a I want, and still only need to know _(X_1, X2, ...). For time-varying covariates, I can only extend as far as the observed covariates.

CamDavidsonPilon commented 10 years ago

cc @BenDoyle

BenDoyle commented 10 years ago

You need to predict the covariates also. Or alternatively, you can use time lagged covariates. So the effect is regressed on covariates only from the past, up to a fixed lag. Then you can predict the 'future' based on present covariates, out as far as the lag.

bellkev commented 10 years ago

I have a quick, naive question on the topic of this issue regarding applying either the Cox or Aalen models with time-dependent covariates.

I haven't dug deep into the details of how exactly these models handle time-dependence, but is the magnitude of the time derivative of the independent variables a factor in determining their coefficients in a given model? I ask because it seems like a lot of the literature on the subject refers to variables like blood pressure or cholesterol levels with objective healthy ranges in a medical context, where absolute values are probably very meaningful. But if these models are applied to a different use case like predicting customer attrition for a SaaS product, then it seems like the time derivative of, say, total monthly purchases could be much more meaningful than the absolute values.

Is that taken into account in these models and I'm just missing it at first glance? If so I'll gladly dig deeper to understand them a bit better. If it is not taken into account, then are there other variants of survival regression that do give more weight to the change of variables over time?

nitsanluke commented 5 years ago

@CamDavidsonPilon I'm having the same confusion on predicting hazards for new patients with time-varying covariates. (For the time steps where we have covariates) According to the earlier statement does it mean that covariates from time step t1 will be used in the multiplication with coefficitents upto t (t <= t1) and then use the covariates from time step t2 (t2> t1) for the next set of time steps.

eg: new patient

timestep id sex start stop status age
t1 1 1 0 1.000000 0 68
t2 1 1 1 1.057534 1 69

Aalen model time (Intercept) sex age [1,] 0.00000000 0.00000000 0.00000000 0.0000000000 [2,] 0.02739726 -0.03219126 0.01176316 0.0003961737 [3,] 0.08219178 -0.04721190 0.02449604 0.0004405415 [4,] 0.27123288 -0.04871640 0.01581899 0.0007919152 [5,] 0.50684932 -0.06037937 0.02904573 0.0007614964 [6,] 0.55890411 -0.05073662 0.04344798 0.0002937678 ..... time (Intercept) sex age [67,] 7.621918 -0.5168784 0.2856985 0.01084973 [68,] 8.334247 -0.5043575 0.2516382 0.01186577 [69,] 8.641096 -0.5913812 0.3174582 0.01228551 [70,] 8.717808 -0.6860278 0.3883691 0.01275000 [71,] 9.145205 -0.7132204 0.3463793 0.01491289 [72,] 9.473973 -0.7368422 0.3029098 0.01720430

So when building the hazard function for a paitent do we use covariates from both timesteps with their respective coefficients (time indexed)? Also using the last time-step time for the prediction also seems unfair as that will be the label which wil be used to evaluate.