broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.7k stars 591 forks source link

Possible enhancements to MCMC. #2824

Closed droazen closed 6 years ago

droazen commented 7 years ago

@samuelklee commented on Mon Oct 05 2015

Some possible enhancements/improvements, in no particular order and of varying scope:

-Change SliceSampler to be able to handle multimodal univariate distributions. Should just be a matter of implementing the pseudocode in Neal 2003 http://projecteuclid.org/download/pdf_1/euclid.aos/1056562461

-Add Metropolis-Hastings univariate sampler as alternative to SliceSampler.

-Add Metropolis-Hastings/nested/etc. multivariate samplers as alternatives to GibbsSampler. This should only be tackled if a model/dataset necessitates it.

-Implement hierarchical/multilevel models in an OOP way. Currently, the samplers operate on lists of global parameters and lists of lists of "local" parameters (i.e., segment-level or site-level parameters), which is a bit clunky.

-Add convergence diagnostics (e.g., autocorrelation time).

-Add ability to make trace plots and corner plots.

-Implement more flexible discarding of burn-in. Currently, samples from all iterations are aggregated in memory. Depending on the maximum number of iterations we want to allow, it might be better to write samples to disk, only store samples in memory after burn-in, etc. so we don't run into memory issues.

-Parallelization (again, only if a model/dataset necessitates it).


@LeeTL1220 commented on Tue Nov 03 2015

@samuelklee Do we need this for the beta release?


@samuelklee commented on Tue Nov 03 2015

I'd say no to pretty much all of the points, except for whatever @davidbenjamin ends up needing to implement for the allele-fraction model (David, last time I looked at your branch there was some MH sampling going on?). Some of them will probably be relatively easy to address before beta (e.g., the first point about fixing up the SliceSampler), but I think they are low priority.

The only thing that we'll definitely have to decide on for beta release is how to store/plot the MCMC chains (i.e., the posterior samples). If all people want to see is posterior point estimates + credible intervals, we can just discard the chains, but this seems somewhat wasteful to me.

samuelklee commented 6 years ago

Now that python is in the mix we should just use MCMC packages that are already out there.