CamDavidsonPilon / lifelines

Survival analysis in Python
lifelines.readthedocs.org
MIT License
2.32k stars 551 forks source link

`robust` is ignored when `cluster_col` is set #1598

Open benslack19 opened 4 months ago

benslack19 commented 4 months ago

TL;DR: When using CoxPHFitter.fit(), it doesn't matter whether a value for robust is specified. If there's a cluster_col specified, then presumably the Huber sandwich estimator will always be used.

I was using cluster_col in the CoxPHFitter and saw in the docstring that the sandwich estimator automatically gets used. I was aiming to match the standard errors in a test case with a CoxTimeVarying model by setting robust to the same value in the CoxPHFitter and CoxTimeVarying. (This explains my test data below.) However, I saw from issue #544 that the CoxTimeVarying has not been implemented leaving me only the option to set robust=False in the CoxPHFitter model. For the test case, I can just leave cluster_col unspecified. I think an error or error message should be returned in the case of cluster_col being set and robust=False. It looks like this conditional needs to be edited.

Here's a reproducible example with my comments:

import numpy.testing as npt
import pandas as pd
from lifelines import CoxPHFitter, CoxTimeVaryingFitter
from lifelines.datasets import load_stanford_heart_transplants
from lifelines.utils import to_long_format

stanford = load_stanford_heart_transplants()

# Keep only the last record for each subject, drop all covariate columns except age to simplify data
stanford_last = (
    stanford.groupby("id")
    .tail(1)
    .drop(["year", "surgery", "transplant"], axis="columns")
)

# Format the data for CPH model
stanford_last_cph_wid = stanford_last.rename(
    columns={"start": "W", "stop": "T", "event": "E"}
)
stanford_last_cph_wid.head()

image

Create a CoxPHFitter model and fit it with the cluster_col specified.

cph_stanford_last_wid = CoxPHFitter()
cph_stanford_last_wid.fit(
    stanford_last_cph_wid,
    duration_col="T",
    event_col="E",
    entry_col="W",
    cluster_col="id",
)
cph_stanford_last_wid.summary

image

However, if both a cluster_col and robust was specified, the SE value is always the same (0.14374) regardless of the value for robust.

image

The standard error is different (0.13862) when cluster_col is not specified, therefore letting robust be set to its default value of False.

image

lifelines version: 0.27.8