chjackson / flexsurv

The flexsurv R package for flexible parametric survival and multi-state modelling
http://chjackson.github.io/flexsurv/
53 stars 28 forks source link

faster bootstrap iterations for large data? #168

Open markdanese opened 11 months ago

markdanese commented 11 months ago

Is there an option or a way to do a multi-threaded version of the bootstrap? I am using standsurv and the data is pretty big (280,000 people). It takes about a minute and a half to run each bootstrap iteration, so running a few hundred is pretty expensive. The delta method was over an hour and I stopped it. I am happy to run multiple threads myself and pool the iterations but it seems that the only thing returned are the summary statistics (i.e., 95% CI) and not the actual results for each iteration. Is there a creative way to work around this that I am not seeing?

chjackson commented 11 months ago

flexsurv::bootci.fmsm can do multicore bootstrapping for any user-defined output from a flexsurv model, but there is no multicore feature in standsurv (copied to @mikesweeting as the author of this function).

mikesweeting commented 11 months ago

Hi both. standsurv follows the way bootstrapping is done in summary.flexsurvreg which uses the normbootfn.flexsurvreg function. This doesn't seem to have the same multicore functionality that bootci.fmsm does.

@chjackson; could you perhaps explain the difference between these two functions? If you think we should switch to using bootci.fmsm instead I could look into implementing this.

More generally I've been considering whether standsurv should use bootstrapping as the default (rather than the delta method) as this would match summary.flexsurvreg. Thoughts?

chjackson commented 11 months ago

This is a bit messy unfortunately! These functions have grown organically rather than being meticulously designed.

normbootfn.flexsurvreg is not exposed to users, and is limited to bootstrapping the functions included in summary.fns. Hence it assumes the function handles the t and start arguments, which a user-supplied function might not.

bootci.fmsm was designed initially to handle multistate models (hence the name), but as a consequence it handles simple survival models too. It can deal with any function of the parameters, and it's user-visible. So it'd be more generalisable to use this instead. The only thing I can see it's missing is the rawsim feature added for doing causal contrasts in standsurv.

I have an intuitive preference for the parametric bootstrap over the delta method, but that is not based on any systematic comparison. There are some mixed results comparing these approaches in this paper.