broadinstitute / gatk-protected

Obsolete/Legacy GATK repository -- go to https://github.com/broadinstitute/gatk instead
BSD 3-Clause "New" or "Revised" License
33 stars 20 forks source link

kmer-based M2 PoN #994

Closed davidbenjamin closed 7 years ago

davidbenjamin commented 7 years ago

The simplest idea is to take kmers (k = 5, 7, 10?) centered at variant positions and fit a distribution (beta distribution?) of artifact allele fractions for each kmer.

Back of the envelope: with k = 10 we have 4^10 ~ 1 million different kmers, so each kmer appears ~ 3000 times per genome or about 1 million times in our panel of normals. This is easily enough to fit the distribution of artifact fractions very precisely.

In addition to beta distributions, we may wish to fit different distributions for artifact allele fractions, such as a mixture of no artifacts (other than base errors as expected from the base quals) and a beta.

droazen commented 7 years ago

Issue moved to broadinstitute/gatk #2973 via ZenHub