CamDavidsonPilon / lifelines

Survival analysis in Python
lifelines.readthedocs.org
MIT License
2.32k stars 551 forks source link

Combining continous and categorical variable with formula can lead to an error at inference #1532

Closed AlexandreAbraham closed 1 year ago

AlexandreAbraham commented 1 year ago

Fitting a CoxPHFitter with a categorical variable and making inference on a dataset that has not all the same categories fails.

The reason is that preprocessing at inference is made from the formula itself instead of the having the same preprocessing than at fitting time. Note that this can also lead to bugs when making changes in categories but keeping the same cardinality.

How to reproduce:

from lifelines import CoxPHFitter
import pandas as pd

cph = CoxPHFitter()
# Create a dummy dataset with some one continuous and one categorical features
df = pd.DataFrame({
    'time': [1, 2, 3, 1, 2, 3], 'event': [0, 1, 1, 1, 0, 0],
    'cov_cont':[0.1, 0.2, 0.3, 0.1, 0.2, 0.3], 'cov_categ': ['A', 'A', 'B', 'B', 'C', 'C']})
cph.fit(df, "time", "event", formula="cov_cont + C(cov_categ)")
# The following works as expected
cph.predict_survival_function(df)
# The follwing crashes because at inference category C is missing
cph.predict_survival_function(df.iloc[:4])

The solution is not to rely on a formula but on the model specs generated from formulaic. I will provide a PR along with this bug report.