mixture cure model in lifelines

SpirosFot commented 5 years ago

Hi there,

I am interested in implementing a mixture cure model with multiple events as the one implemented in Dirick. et al (2015). In their paper they create a mixture cure model for predicting default,prepayment and maturity (cure) of a loan. In that model, the unconditional survival function is given by

while the conditional survival function modelling all cases that are suspectible to default is given by a CoxPH model of the type :

I am wondering if something like this can be implemented through lifelines

CamDavidsonPilon commented 5 years ago

This can be done only if you specify what H_0 is parametrically (ideally we want non-parametric, but we can use some flexible parametric models too).

Have you seen the docs here? https://lifelines.readthedocs.io/en/latest/jupyter_notebooks/Custom%20Regression%20Models.html#Cure-models

That's important to read, for sure. Here's a first iteration of a model:

from lifelines.fitters import ParametricRegressionFitter
from autograd.scipy.special import expit
from autograd import numpy as np
from lifelines.utils.safe_exp import safe_exp
exp = safe_exp
dot = np.dot

class CompetingCureModel(ParametricRegressionFitter):

    _fitted_parameter_names = ["beta_repay", "c_repay", "c_default", "beta_default"]

    def _cumulative_hazard(self, params, T, Xs):
        p_default = expit(dot(Xs["c_default"], params["c_default"]))
        p_repay = expit(dot(Xs["c_repay"], params["c_repay"]))

        sf_default = exp(-exp(dot(Xs["beta_default"], params["beta_default"])) * T**2)
        sf_repay = exp(-exp(dot(Xs["beta_repay"], params["beta_repay"])) * T)

        return -np.log(
            p_repay * sf_repay + p_default * sf_default + (1-p_repay-p_default)
        )

swf = CompetingCureModel(penalizer=0.1)

rossi = load_rossi()
rossi["intercept"] = 1.0
rossi['week'] = rossi['week'] / 54.

covariates = {
    "beta_default": rossi.columns, 
    "beta_repay": rossi.columns, 
    "c_default": rossi.columns,
    "c_repay": rossi.columns
}

swf.fit(rossi, "week", event_col="arrest", regressors=covariates)
swf.print_summary(2)

CamDavidsonPilon commented 5 years ago

Reading the original paper, the way I model the probabilities (pi) is wrong, FYI. I think this is correct:

from lifelines.fitters import ParametricRegressionFitter
from autograd.scipy.special import expit
from autograd import numpy as np
from lifelines.utils.safe_exp import safe_exp
exp = safe_exp
dot = np.dot

class CompetingCureModel(ParametricRegressionFitter):

    _fitted_parameter_names = ["beta_repay", "c_repay", "c_default", "beta_default"]

    def _cumulative_hazard(self, params, T, Xs):
        p_default_ = exp(dot(Xs["c_default"], params["c_default"]))
        p_repay_ =   exp(dot(Xs["c_repay"], params["c_repay"]))
        p_alt_ =     1.

        p_default = p_default_ / (p_alt_ + p_default_ + p_repay_)
        p_repay = p_repay_ / (p_alt_ + p_default_ + p_repay_)
        p_alt = p_alt_ / (p_alt_ + p_default_ + p_repay_)

        # cox like hazards.
        sf_default = exp(-exp(dot(Xs["beta_default"], params["beta_default"])) * T**2)
        sf_repay = exp(-exp(dot(Xs["beta_repay"], params["beta_repay"])) * T)

        return -np.log(
            p_repay * sf_repay + p_default * sf_default + p_alt
        )

swf = CompetingCureModel(penalizer=0.001)

rossi = load_rossi()
rossi["intercept"] = 1.0
rossi['week'] = rossi['week'] / 54.

covariates = {
    "beta_default": rossi.columns,
    "beta_repay": rossi.columns,
    "c_default": rossi.columns,
    "c_repay": rossi.columns
}

swf.fit(rossi, "week", event_col="arrest", regressors=covariates, timeline=np.linspace(0, 2))
swf.print_summary(2)

SpirosFot commented 5 years ago

Hi Cameron,

Thanks very much for your answer, it was more than helpful. One thing I can't really grasp yet is why you deal with the time in the conditional survival function as follows :

sf_default = exp(-exp(dot(Xs["beta_default"], params["beta_default"])) * T**2) sf_repay = exp(-exp(dot(Xs["beta_repay"], params["beta_repay"])) * T)

You multiply the cond. survival function for default by t^2 and the cond. survival function for repayment by t. Can you explain to me why this is the case? I see that you do something really similar in your custom regression model documentation but I can't understand why. Sorry if the question is a bit silly, but I am trying to really grasp the concept of this. Thanks

CamDavidsonPilon commented 5 years ago

It's very attentive of you to notice this!

What is the `T` term representing?

Because we need a parametric model, I require some known function for the integral of h_0(t). The simplest case I can think of is integral of h_0(t) = T, and only slightly more complicated: integral of h_0(t) = T^2.

Why are the `T` terms different?

One thing I wanted to avoid is non-identifiability between the two survival curves. If both have identical structures, then the minimization function has two identical minima because of the symmetry of the problem. By slightly changing the survival functions, I "nudge" the model away from this. I could fix this by also specifying some random initial points.

CamDavidsonPilon / lifelines