broadinstitute / gatk-protected

Obsolete/Legacy GATK repository -- go to https://github.com/broadinstitute/gatk instead
BSD 3-Clause "New" or "Revised" License
33 stars 20 forks source link

Possible enhancements to MCMC. #126

Closed samuelklee closed 7 years ago

samuelklee commented 9 years ago

Some possible enhancements/improvements, in no particular order and of varying scope:

-Change SliceSampler to be able to handle multimodal univariate distributions. Should just be a matter of implementing the pseudocode in Neal 2003 http://projecteuclid.org/download/pdf_1/euclid.aos/1056562461

-Add Metropolis-Hastings univariate sampler as alternative to SliceSampler.

-Add Metropolis-Hastings/nested/etc. multivariate samplers as alternatives to GibbsSampler. This should only be tackled if a model/dataset necessitates it.

-Implement hierarchical/multilevel models in an OOP way. Currently, the samplers operate on lists of global parameters and lists of lists of "local" parameters (i.e., segment-level or site-level parameters), which is a bit clunky.

-Add convergence diagnostics (e.g., autocorrelation time).

-Add ability to make trace plots and corner plots.

-Implement more flexible discarding of burn-in. Currently, samples from all iterations are aggregated in memory. Depending on the maximum number of iterations we want to allow, it might be better to write samples to disk, only store samples in memory after burn-in, etc. so we don't run into memory issues.

-Parallelization (again, only if a model/dataset necessitates it).

LeeTL1220 commented 9 years ago

@samuelklee Do we need this for the beta release?

samuelklee commented 9 years ago

I'd say no to pretty much all of the points, except for whatever @davidbenjamin ends up needing to implement for the allele-fraction model (David, last time I looked at your branch there was some MH sampling going on?). Some of them will probably be relatively easy to address before beta (e.g., the first point about fixing up the SliceSampler), but I think they are low priority.

The only thing that we'll definitely have to decide on for beta release is how to store/plot the MCMC chains (i.e., the posterior samples). If all people want to see is posterior point estimates + credible intervals, we can just discard the chains, but this seems somewhat wasteful to me.

droazen commented 7 years ago

Issue moved to broadinstitute/gatk #2824 via ZenHub