Posterior allele frequencies

timothymillar commented 3 years ago

See #98 for background

Consider adding an optional field of length "A" (one value for each allele) which contains the dosage of each allele weighted by posterior probability of each genotype for a sample. This would be interpreted as a floating point best estimate of allele counts in each individual and could be summed over all individuals to to produce an approximation of allele frequencies in the population.

An example of calculating this over a full posterior distribution may look like this:

@njit(cache=True)
def dosage_posterior(posteriors, ploidy, n_alleles):
    n_genotypes = len(posteriors)
    dosage_posterior = np.zeros(n_alleles, dtype=np.float32)
    genotype = np.zeros(ploidy, np.int64)
    for i in range(n_genotypes):
        p = posteriors[i]
        for j in range(ploidy):
            dosage_posterior[genotype[j]] += p
        increment_genotype(genotype)
    return dosage_posterior

But this could be calculated more efficiently directly from the trace of an MCMC.

timothymillar commented 3 years ago

Probably better to as "posterior mean allele frequency" (PMAF) where the frequency of each allele across the entire MCMC trace is recorded. The sum of PMAF across all alleles is then 1 and the PMAF can be multiplied by ploidy if one needs the weighted allele counts. It would be easiest to implement this only for haplotype call and call-exact where the PMAF is guaranteed to sum to 1.

timothymillar commented 2 years ago

Best to add two options here:

AFP for allele frequency posteriors which is useful for sgkit
DS for posterior dosage of alternate alleles (drop the reference) which is what Jeff E. group are using in GWASpoly

PlantandFoodResearch / MCHap

Posterior allele frequencies #103