CamDavidsonPilon / lifelines

Survival analysis in Python
lifelines.readthedocs.org
MIT License
2.32k stars 551 forks source link

Why does CoxPH always underperforming vs a simple Kaplan Meier? #1476

Closed daniyalshahzad closed 1 year ago

daniyalshahzad commented 1 year ago

For context, my company have outsourced a lifetime prediction to a big firm. The goal is to predict lifetimes of a categorical variable. The data includes customers from 2012 to 2022. I have the survival times. If the customer is still with us, they get censored in 2022 (today). There is only 1 categorical covariate with 6 categories (dummy Variables).

Big Firm's Model: 1) Generate KM curves for each (Category + Cohort) where Cohort is (Year-month-01). For example, category_3|2015-05-01. 2) For each Category, get an average of the (Category + Cohort) KM curves to get 1 single KM curve per category. The average is a exponentially weighted average to give recent Cohorts more importance.

My model: I have used CPH with breslow estimation for baseline hazard. Covariates include the categorical covariate, cohort year, cohort month. Without assuring PH assumption since I am only concerned with survival prediction

Validation: Artificial censoring: I will artificially censor the data at a specific date, for example 2019-01-01. Train both models on this. Then I will try to predict survival/retention at 2021-01-01. Then I find the error, Predicted Retention - Actual Retention for analysis.

Results: KM curves are always correctly predicting retention while CoxPH overestimates the retention by alot. Cox = .55 vs KM = .42 vs acutal = .4. I have tried using "sample_weights" column to give more importance to recent entries but to no avail.

Is there any underlying issue? I can give more information if needed.

Closing it since, I was violating the common baseline hazard assumption. Even though I was only interested in the predictions, the proportional hazard assumption was important here and produced poorer results until I stratified.