CamDavidsonPilon / lifelines

Survival analysis in Python
lifelines.readthedocs.org
MIT License
2.34k stars 557 forks source link

Performance of compute residuals #1067

Open CamDavidsonPilon opened 4 years ago

CamDavidsonPilon commented 4 years ago

@ybagdasa wrote

@CamDavidsonPilon I'm trying to use compute_residuals on a dataframe with 5M observations and after tens of minutes it is unclear whether it will ever finish computing. I suspect the dataframe is probably too large to do the computation as is in a reasonable amount of time. I'd like to avoid significantly scaling down as events constitute a small fraction of the observations and I need the statistics. Is there an existing solution for this?

@ybagdasa, to confirm, you were computing the schoenfeld residuals?

ybagdasa commented 4 years ago

yes, with 6 degrees of freedom

CamDavidsonPilon commented 4 years ago

hm, what do you mean "6 degrees of freedom"?

ybagdasa commented 4 years ago

like 6 covariates to fit

ybagdasa commented 4 years ago

I did a little experiment where I varied the sample size of my df and then fit it with cox using clustering. For my ~6M data set, the computation time extends to 70 days. cph.fit(df.reset_index(),duration_col='duration',event_col='event_status',cluster_col='pid',show_progress=True) Screenshot from 2020-06-26 15-16-34 Screenshot from 2020-06-26 15-16-41

CamDavidsonPilon commented 4 years ago

😅

(to be clear, the above is measuring fit, right? and not compute_residuals?)

yea this is too bad. It's possible to fudge some numbers to get better performance: if you can, try binning your durations into larger buckets (i.e. rounding) - that might trigger an internal switch to another, more efficient, algorithm.

ybagdasa commented 4 years ago

@CamDavidsonPilon Yes that's fitting.

I ended up doing a little workaround where I divied up the data into 50 k samples and then ran those in parallel and combined the coefficients and covariance matrix afterwards using some normal approximation assumptions. Took about 6 hours to run. Not the most ideal, but it worked.

CamDavidsonPilon commented 4 years ago

@ybagdasa under the hood, CoxPHFitter actually has two fitting algorithms, and sometimes one is much more performant than the other (depending on data shape). I may have found a bug in how the algorithms are chosen. If you're willing to retry, can try the following kwarg on your entire dataset?

CoxPHFitter().fit(...., batch_mode=True)

(Also, be sure to be on a modern version of lifelines, as I'm always making perf improvements)

ybagdasa commented 4 years ago

@CamDavidsonPilon I think I tried that earlier and it didn't seem to help, but I can give it another try. Also to be clear, I think where it's taking the most time is actually after converging, when it does whatever it does to cluster on the specified cluster_col.

CamDavidsonPilon commented 4 years ago

hmmm interesting! You are probably right there, that hasn't been optimized much.