CamDavidsonPilon / lifelines

Survival analysis in Python
lifelines.readthedocs.org
MIT License
2.37k stars 560 forks source link

Robust Variance for Kaplan Meier #510

Open pzivich opened 6 years ago

pzivich commented 6 years ago

Sandwich variance estimators provide correct CL coverage for inverse probability weighted data. There is a robust variance for Kaplan Meier

Instead of KaplanMeierFitter generating the warning when non-integer weights are detected, it could automatically calculate the confidence intervals based on this instead.

I have a version written what works for IPW KM data (wrote it for something separate). It would be fairly easy to add.

Furthermore, we could generalize the formula to account for user-specified groupings to account for correlated survival times. This would require some additional thought/rewriting, since I wrote the function to make individuals as their own groups (appropriate for IPW). Essentially, it would have to be more general to allow any potential grouping pattern

Potential issue: the paper I linked derives the variance for the standard Greenwood's variance estimator. It does not have the log transformation that bounds it between 0-1. I am not too confident in how to log-transform the robust variance estimator

CamDavidsonPilon commented 6 years ago

I like this idea - it will make survival curves for reoccurrences possible too.

CamDavidsonPilon commented 6 years ago

This also fits well into my roadmap of "robust errors" for lifelines https://github.com/CamDavidsonPilon/lifelines/issues/494#issuecomment-402501415

pzivich commented 6 years ago

Just to keep a record of what I have, this is the robust variance calculator I wrote for my purpose. It still has to be generalized for any group variable (it works for IDs only currently). Essentially, it needs another line to calculate the average within-groups (the actual value for the individual obs) then continue with the rest

I will try to work on it this weekend (I will also get to finishing the Aalen-Johansen estimator #413 and write up some simulation data to show why we care to use it in the presence of competing events)

def robust_var(km, survival, event_table, weights, C):
    unique_event_times = event_table['observed'].loc[event_table['observed'] >= 1].index
    zck_St_j = 0
    Stj_lag = 1
    var_j = [0]
    for j in unique_event_times:
        delta_ck = np.where(km.durations == j, 1, 0) * np.array(km.event_observed) * np.array(weights)
        gamma_ck = np.where(km.durations >= j, 1, 0) * np.array(weights)
        ri = float(event_table.loc[event_table.index == j]['n'].values)
        di = float(event_table.loc[event_table.index == j]['observed'].values)
        qi = di / ri
        Stj = float(survival.loc[survival.index == j].values)

        zck_qi = (delta_ck - qi * gamma_ck) / ri
        zck_St_j = (1 - qi) * zck_St_j + Stj_lag * zck_qi
        z_bar = np.sum(zck_St_j) / C
        var_Stj = C * np.sum((zck_St_j - z_bar) ** 2 / (C - 1))
        var_j.append(var_Stj)
        Stj_lag = Stj
    return var_j
pzivich commented 6 years ago

Also there is an easy check to make sure that the variance is calculated correctly. Following Williams RL 1995, the variance from the robust variance should reduce to Greenwoods * N / (N-1) when observations are NOT clustered

Probably useful to include in testing

pzivich commented 6 years ago

@CamDavidsonPilon what is your plan for calling robust errors? Do you plan on placing something when the class is initialized (i.e. lifelines.KaplanMeierFitter(robust_error=True))? Or when you call fit?

It would be good to keep it consistent with the Cox Model Fitters

CamDavidsonPilon commented 6 years ago

Good q: I was thinking in the call to init for both classes, but also possible to override in the fit (similar to strata for coxph)

pzivich commented 6 years ago

@CamDavidsonPilon Alright, the problems/complexity for this are coming back to me now that I am working on writing the code, that allows specific groups (it was easier to write where each individual is their own group)

I think two optional arguments will have to be added for robust variance; id which designates individual ID's and group_id which designates the group ID. For each time-step, you need to calculate a quantity z_ck(S_t(j)), meaning each individual (k) in each group (c) at time j has some quantity. As j goes forward in time, you need to use the previous z_ck(S_t(j)) to calculate the next ones. When I wrote the code for each individual IS their own group, it simplifies to z_c(S_t(j)) = z_ck(S_t(j)), making life much easier. Ultimately, I think we will need to have a column for the id variables (especially with observations divided into n-unit time-periods), so this sum can be calculated correctly. Let me know if this is alright or if you have another idea to bypass this issue.

This also raised another issue, which I need to think about more and try to find a source. That is late-entries. How should they factor into this sum? My initial thought, is since they late-enter and are known to survive until entrance, their initial z_ck(S_t(j)) value should be zero. They should not start accumulating contributions to the variance until they have been observed

pzivich commented 5 years ago

@CamDavidsonPilon I plan on trying to get back to working on this soon. I looked through the robust variance for the Cox model. It looks like robust weights are calculated by each individual, correct? Like you would not be able to specify which individuals are in a group (like GEE)? Am I correct in understanding this? If so, I will write up the robust KM to do the same process, each individual is their own group

CamDavidsonPilon commented 5 years ago

So far, you are correct that no groups are allowed (and if I understand your question, this is the cluster functionality in R).

findaz commented 4 years ago

@pzivich @CamDavidsonPilon, I think this is what I was looking for. Do you have an implementation yet? I'd be glad to try it. In the mean time, am I correct in thinking that correlated groups of observations (such as recurring events) do not effect the survival function point estimates but just effect variances? This is specifically in reference to the KM SF rather than any other method.

pzivich commented 4 years ago

Hi @findaz I don't have an implementation outside of what is presented in the function above. If I am re-calling correctly, my function written above treats each individual as their own group (i.e. it doesn't consider that individuals 1 and 2 are both in group 1). I had hit a wall for generalizing it to those scenarios. Based on the linked paper above, this variance estimator would be valid in those correlated groups as you describe.

The one issue I have come across with this approach is that I hadn't found an implementation in SAS to compare to (but upon re-looking, it looks like there is an R implementation I can compare to). I haven't come across the robust variance for the KM used in practice.

I had originally wrote the robust variance to calculate variances for IPW-KM survival functions (to avoid doing a bootstrap). However, I have learned that the robust variance does not necessarily work for all IPW estimands. The variance estimator would have to be studied more extensively (something currently above my level of understanding).