Fitting a CoxPHFitter with a categorical variable and making inference on a dataset that has not all the same categories fails.
The reason is that preprocessing at inference is made from the formula itself instead of the having the same preprocessing than at fitting time. Note that this can also lead to bugs when making changes in categories but keeping the same cardinality.
How to reproduce:
from lifelines import CoxPHFitter
import pandas as pd
cph = CoxPHFitter()
# Create a dummy dataset with some one continuous and one categorical features
df = pd.DataFrame({
'time': [1, 2, 3, 1, 2, 3], 'event': [0, 1, 1, 1, 0, 0],
'cov_cont':[0.1, 0.2, 0.3, 0.1, 0.2, 0.3], 'cov_categ': ['A', 'A', 'B', 'B', 'C', 'C']})
cph.fit(df, "time", "event", formula="cov_cont + C(cov_categ)")
# The following works as expected
cph.predict_survival_function(df)
# The follwing crashes because at inference category C is missing
cph.predict_survival_function(df.iloc[:4])
The solution is not to rely on a formula but on the model specs generated from formulaic. I will provide a PR along with this bug report.
Fitting a CoxPHFitter with a categorical variable and making inference on a dataset that has not all the same categories fails.
The reason is that preprocessing at inference is made from the formula itself instead of the having the same preprocessing than at fitting time. Note that this can also lead to bugs when making changes in categories but keeping the same cardinality.
How to reproduce:
The solution is not to rely on a formula but on the model specs generated from formulaic. I will provide a PR along with this bug report.