CamDavidsonPilon / lifelines

Survival analysis in Python
lifelines.readthedocs.org
MIT License
2.37k stars 560 forks source link

CoxPHFitter tie_method support more options #580

Open JayGuAtGitHub opened 5 years ago

JayGuAtGitHub commented 5 years ago

Do we have plans to add more supporting to this [tie_method] in CoxPHFitter?

In fact I try to do an analysis for the data, but the result is different with what I didin the SPSS. I also tried in R, and found that in R, if I set "breslow" in something like "ties=c("efron","breslow","exact"), ", it will return the same result as SPSS.

So I suppose the [tie_method] will provide the same ability. Am I right? And if we have any plan to implement that?

Many thx!

CamDavidsonPilon commented 5 years ago

Hey there, this isn’t on the short term roadmap. I’m curious for your use cases for wanting other tie methods, though. From what I’ve read, Efron is the fast and gives a much better approximation than Breslow. In fact, I’ve seen commentary that complains that SPSS default is breslow and not Efron.

pzivich commented 5 years ago

Same issue with SAS is that it defaults to Breslow, rather than Efron. R's "exact" method is a discrete time logistic model, not the exact probablities (SAS allows for discrete and exact). It results in odds ratios, not hazard ratios.

Discrete as an option may be a nice addition to lifelines. It would allow for discrete time data with the CPH model.

JayGuAtGitHub commented 5 years ago

Many thx for your and sorry for giving the reply so late.

Currently we decide to use [lifelines] and [Efron]. We check some paper and there are indeed some which said [Efron] is better.

We also test it in R and found the results are very different. We are wondering if [lifelines] does it in [multiple variable] way?

I’m curious for your use cases for wanting other tie methods, though.

We are working with doctors and some statistics guys to try to build an auto data analysis system. Cox is an important node inside. In the past we do the manul work in SPSS and R. Both return the same result. So I was thinking about if [lifelines] can do the same thing.

Same issue with SAS is that it defaults to Breslow, rather than Efron. R's "exact" method is a discrete time logistic model, not the exact probablities (SAS allows for discrete and exact). It results in odds ratios, not hazard ratios.

It means that I can't reproduce the same result in R?

CamDavidsonPilon commented 5 years ago

R (using Efron's tie method, which is default) and lifelines should be the same, and if they are not, I would be very curious! If you see differences, please post the code you are using.

pzivich commented 5 years ago

So, SAS's Efron, R's Efron, and lifelines's Efron should all produce the same results.

SAS's Breslow and R's Breslow should produce the same results.

SAS's Discrete and R's Exact should produce the same results. Note that this method is for discrete time (not continuous time, like the other methods). Additionally, it produces odds ratios, not hazard ratios. This is a result of the partial likelihood function. This method should only be used if you are really in a discrete setting

SAS's Exact is not available in any other software that I am aware of. Rather than using the Breslow or Efron approximations, it calculates the exact probability. It takes a lot of calculations and is time-consuming. Not much is gained by using Exact over Efron. It really is only feasible for a small amount of ties

JayGuAtGitHub commented 5 years ago

I think we now already detected the problem, but still not sure if we use lifelines in the correct way.

We use load_rossi as our test data.

When I set like

cph.fit(data, duration_col="week", event_col="arrest", show_progress=True)

data is just the load_rossi ()

it returns like

  coef exp(coef) se(coef) z p lower 0.95 upper 0.95
fin -0.37942 0.684257 0.191379 -1.98256 0.047416 -0.75452 -0.00433
age -0.05744 0.944181 0.021999 -2.61093 0.00903 -0.10055 -0.01432
race 0.3139 1.368753 0.307992 1.01918 0.308117 -0.28975 0.917554
wexp -0.1498 0.860883 0.212225 -0.70584 0.48029 -0.56575 0.266157
mar -0.4337 0.648105 0.381861 -1.13576 0.256057 -1.18214 0.314732
paro -0.08487 0.918631 0.195757 -0.43355 0.664614 -0.46855 0.298805
prio 0.091498 1.095814 0.028648 3.193868 0.001404 0.035349 0.147646

when I use R, I use like this

y<-Surv(time=coxdata$week,event=coxdata$arrest) a<-coxph(y~age,data=coxdata,ties="efron")

it returns like

  coef exp(coef) se(coef) z p lower 0.95 upper 0.95
age -0.07284 0.929745 0.02079 -3.50392 0.000458 -0.11359 -0.0321

Then we try to remove all columns except, age, week, arrest and run with lifelines again, it returns the same result!

That's why I just asked about

We are wondering if [lifelines] does it in [multiple variable] way?

Seems no big issue so far. Thanks a lot for your work and this package.

One small question, how the [multiple variable] really works?