Open CamDavidsonPilon opened 4 years ago
yes, with 6 degrees of freedom
hm, what do you mean "6 degrees of freedom"?
like 6 covariates to fit
I did a little experiment where I varied the sample size of my df and then fit it with cox using clustering. For my ~6M data set, the computation time extends to 70 days.
cph.fit(df.reset_index(),duration_col='duration',event_col='event_status',cluster_col='pid',show_progress=True)
😅
(to be clear, the above is measuring fit
, right? and not compute_residuals
?)
yea this is too bad. It's possible to fudge some numbers to get better performance: if you can, try binning your durations into larger buckets (i.e. rounding) - that might trigger an internal switch to another, more efficient, algorithm.
@CamDavidsonPilon Yes that's fitting.
I ended up doing a little workaround where I divied up the data into 50 k samples and then ran those in parallel and combined the coefficients and covariance matrix afterwards using some normal approximation assumptions. Took about 6 hours to run. Not the most ideal, but it worked.
@ybagdasa under the hood, CoxPHFitter
actually has two fitting algorithms, and sometimes one is much more performant than the other (depending on data shape). I may have found a bug in how the algorithms are chosen. If you're willing to retry, can try the following kwarg on your entire dataset?
CoxPHFitter().fit(...., batch_mode=True)
(Also, be sure to be on a modern version of lifelines, as I'm always making perf improvements)
@CamDavidsonPilon I think I tried that earlier and it didn't seem to help, but I can give it another try. Also to be clear, I think where it's taking the most time is actually after converging, when it does whatever it does to cluster on the specified cluster_col.
hmmm interesting! You are probably right there, that hasn't been optimized much.
@ybagdasa wrote
@ybagdasa, to confirm, you were computing the
schoenfeld
residuals?