PlantandFoodResearch / MCHap

Polyploid micro-haplotype assembly using Markov chain Monte Carlo simulation.
MIT License
18 stars 3 forks source link

Posterior allele frequencies #103

Closed timothymillar closed 2 years ago

timothymillar commented 3 years ago

See #98 for background

Consider adding an optional field of length "A" (one value for each allele) which contains the dosage of each allele weighted by posterior probability of each genotype for a sample. This would be interpreted as a floating point best estimate of allele counts in each individual and could be summed over all individuals to to produce an approximation of allele frequencies in the population.

An example of calculating this over a full posterior distribution may look like this:

@njit(cache=True)
def dosage_posterior(posteriors, ploidy, n_alleles):
    n_genotypes = len(posteriors)
    dosage_posterior = np.zeros(n_alleles, dtype=np.float32)
    genotype = np.zeros(ploidy, np.int64)
    for i in range(n_genotypes):
        p = posteriors[i]
        for j in range(ploidy):
            dosage_posterior[genotype[j]] += p
        increment_genotype(genotype)
    return dosage_posterior

But this could be calculated more efficiently directly from the trace of an MCMC.

timothymillar commented 3 years ago

Probably better to as "posterior mean allele frequency" (PMAF) where the frequency of each allele across the entire MCMC trace is recorded. The sum of PMAF across all alleles is then 1 and the PMAF can be multiplied by ploidy if one needs the weighted allele counts. It would be easiest to implement this only for haplotype call and call-exact where the PMAF is guaranteed to sum to 1.

timothymillar commented 2 years ago

Best to add two options here: