kmer-based M2 PoN - Githubissues

The simplest idea is to take kmers (k = 5, 7, 10?) centered at variant positions and fit a distribution (beta distribution?) of artifact allele fractions for each kmer.

Back of the envelope: with k = 10 we have 4^10 ~ 1 million different kmers, so each kmer appears ~ 3000 times per genome or about 1 million times in our panel of normals. This is easily enough to fit the distribution of artifact fractions very precisely.

In addition to beta distributions, we may wish to fit different distributions for artifact allele fractions, such as a mixture of no artifacts (other than base errors as expected from the base quals) and a beta.

broadinstitute / gatk-protected

kmer-based M2 PoN #994