CamDavidsonPilon / lifelines

Survival analysis in Python
lifelines.readthedocs.org
MIT License
2.37k stars 560 forks source link

baseline functions differ in R and Python #1146

Open Valery2511 opened 4 years ago

Valery2511 commented 4 years ago

Hello! I am translating the Cox model from R to Python and found that baseline functions in R differs from the baseline functions in Python. Based on the data from the file test.xlsx, the results are as follows: Compare_baseline_functions — копия_3

It seemed strange to me, considering that the lifelines library is based on the codes of functions from R (if I understand correctly).

At the same time, the characteristics of the Cox model and the coefficient before regressor “var_const” turned out to be the same in R and Python (coef[var_cost] = 0,05295).

Tell me please why baseline functions may differ in R and Python?

Code in R:

library("survival")
library(survminer)

#### Data download:
test_R <- read_excel("test.xlsx")

#### Data preparation:
test_R$start = test_R$months
test_R$stop = test_R$start + 1

#### Run the model:
res.cox1 <- coxph(Surv(start, stop, event)~ var_const, data = test_R, method = 'efron')
summary(res.cox1)

#### Calculate baseline cumulative hazard function:
bhest = basehaz(fit = res.cox1)
basehaz_predict <- data.frame(time = bhest$time,
                              hazard = bhest$hazard)

#### Calculate baseline survival function
cox_survival = survfit(res.cox1)
surv_predict <- data.frame(time = cox_survival$time,
                           survival = cox_survival$surv)

Code in Python:

import pandas as pd
import numpy as np
from lifelines.utils import to_long_format
from lifelines import CoxTimeVaryingFitter

#### Data download:
df = pd.read_excel('test.xlsx')

#### Data preparation:
df_model = df_model[['id', 'event', 'months', ‘var_const’]]
df_model = to_long_format(df_model, duration_col = "months")
df_model['start'] = df_model['stop']
df_model['stop'] = df_model['start']+1

#### Run the model:
ctv = CoxTimeVaryingFitter()
ctv.fit(df_model, id_col ='id', event_col='event', start_col = 'start', stop_col = 'stop')
ctv.print_summary()

#### Calculate baseline cumulative hazard function:
ctv.baseline_cumulative_hazard_

#### Calculate baseline survival function:
ctv.baseline_survival_
CamDavidsonPilon commented 4 years ago

Hi @Valery2511 At first thought, I'm not sure. I do have tests in the project comparing lifelines baseline vs R's baseline. I will have to do some digging.

Are you comfortable providing me the dataset you are using? That would help me source the difference.

Valery2511 commented 4 years ago

@CamDavidsonPilon, thank you for answering! I attached the dataset at the beginning of the question but just in case I attach it again: https://github.com/CamDavidsonPilon/lifelines/files/5305589/test.xlsx

Valery2511 commented 4 years ago

@CamDavidsonPilon, may be you will be interested, I also apply a file with the results of comparison of the baseline functions not only of R and Python but also of STATA (the dataset is the same - test.xlsx): Compare_baseline_functions_R_Python_STATA.xlsx

The most interesting thing is that all three programs give three slightly different results:)

Code in STATA:

#### Data download:
import excel "test.xlsx", sheet("Sheet1") firstrow
sort id months

#### Run the model:
stset months, id(id) failure(event)
stcox var_const, efron

#### Calculate baseline cumulative hazard function:
predict S0_hazard_cumulative, basechazard

#### Calculate baseline survival function:
predict S0_baseline_survivor, basesurv  
PortlandMichelle commented 2 years ago

Hello - I'm having a similar issue. I've fit the same model in CoxTimeVaryingFitter and R's Survival coxph function. Coefficients are the same (or similar enough) but baseline cumulative hazards are quite different. This post caught my eye and I was wondering if anything ever came of it?

CamDavidsonPilon commented 2 years ago

Hi @PortlandMichelle. Hard to say what might be going on without seeing the dataset used. Can you reproduce it with a small synthetic dataset?

PortlandMichelle commented 2 years ago

Hello - TBH I discovered the difference using a data set that I would not be able to share (due to confidentiality concerns), so I replicated it with the data provided in this post by the original poster. I find the identical results they did (in their original screen shot at the top of this post). That's why I was hoping there'd been some analysis using that data set that helped illuminate the issue back in 2020. But if you're able to do some investigation now, if we could both use the original data when this question was originally made, that would be great.