Closed daniellameeker closed 9 years ago
There are few reasons to use cyclic coordinate descent or any of its variations for unconditional (unstratified) models with few low numbers of covariates. The examples above, if I am reading things correctly, use 1 and 10 covariates and are unconditioned likelihoods. In such situations, almost any fitting algorithm based on least squares (or IRLS) will outperform (by orders of magnitude) coordinated based methods. Cyclops is geared towards high-dimensional (1K, 1M, 10M covariate) problems as often found in observational healthcare data.
Current Cyclops implementation does not use BLAS-level or LINPACK operations, so your specific BLAS implementation (ATLAS, MKL, Rvolution, etc.) should have little influence. Currently, Cyclops also does not implement vectorized or multicore scheduling ... this would be straight-forward to accomplish. However, speed-ups here, even on outrageously expensive 40-core machines, top out at about a 5 - 10 - fold speedup, due to memory bandwidth limitations. There is a GPU-version of the SCCS implementation; see Suchard et al. 2013; it kicks ass.
Thanks - If you don't think that these are a good showcase, is there a test routine somewhere up here that can be sized up and down that I can use to demo performance to our team using my laptop? I want to give our CDRN team a preview of a comparison between the different flavors of R accelerators that have been developed for glm & cox. OHDSI, Oracle, the VA, and Revolution all have solutions that are slightly different.
Also, the examples I'm digging up in the current unit tests you have look like they have traditional multivariate glm regressions with indicator variables (e.g. dobson example), but nothing that I normally would think of as inherently stratified (like a mixed or random effects model with stratified intercepts). This is a weakness of most of the other R accelerators, so I would like to showcase Cyclops if it has these capabilities.
Martijn is working on a simulation framework that closely mirrors the regressions that commonly occur in large-scale cohort and self-controlled designs, i.e. 1,000s of strata (via matching, etc.), 10,000s of subjects, 1,000,000s of covariates. The framework is still half-baked, so you'll (currently) have to wade through the Simulation.R file for pointers.
As an alternative, our industry friends may be able to provide us with an anonymized, permuted, etc., regression example that arises in standard cohort study modeled after one of the larger claims databases.
Avoiding random effects at scale is the primary motivation of fitting a conditioned model. For example, a conditional Poisson regression arises from a Poisson regression where all subjects within a stratum share a random-intercept. By conditioning on the total number of outcome events, the (very large number of) random-effects fall out the likelihood, so one does not need to estimate them directly. This certainly generates a large computation savings. However, I am still very unclear what the statistical efficiency cost is. This is a great research question and I would be very happy to co-supervise a project exploring the impact. There are arguments that conditioning is more efficient and there are are arguments that it is less.
For more complicated random-effects, one can parameterize these directly in the regression design matrix X. If X is sparse, then there is (almost) no memory overhead in adding, for example, an addition covariate per subject (for subject-level random effects) and estimating these coefficients along with all the main effects, assuming one is using some sort of prior (regularization). This ideas works for any regression fitter (glm(), coxph(), etc.) Of course, an implementation that employs a sparse or indicator representation should work best here.
http://mran.revolutionanalytics.com
So here is the benchmarking, I haven't looped for statistics.