Open tiagoantao opened 7 years ago
Thanks Tiago. PR welcome if you have any time.
I've been looking into this and their appear to be two common variations of correction for HE.
Nei and Roychoudhury (1973) describe the method Tiago mentions:
[1] (n_alleles / (n_alleles - 1)) * uncorrected
Nei and Chesser (1983) describe a method using HO :
[2] (n_samples / (n_samples - 1)) * (uncorrected - (ho/(2*n_samples)))
It's a bit confusing because both are written with just n
but the n
has different meaning in each equation.
In the second paper, they say that the first formula is for estimating from population frequencies but that "he has not yet shown how to estimate these quantities for sample allele frequencies", They correction for the second variation incorporates the sampling of genotypes from a multinomial distribution.
(Edit: I was mistaken, the Nei and Chesser (1983) paper doesn't reference the Nei and Roychoudhury (1973) paper)
These variations can differ quite a bit for small sample sizes:
g = allel.GenotypeArray([
[[0, 0], [0, 0], [0, 0]],
[[0, 0], [0, 1], [1, 1]],
[[0, 0], [1, 1], [2, 2]],
[[0, 1], [2, 3], [4, 5]],
])
def he_1973(g):
ac = g.count_alleles()
af = ac.to_frequencies()
uncorrected = allel.heterozygosity_expected(af, 2)
n_allele = np.sum(ac, axis=-1)
n_allele_correction = n_allele / (n_allele - 1)
return n_allele_correction * uncorrected
def he_1987(g):
ac = g.count_alleles()
af = ac.to_frequencies()
uncorrected = allel.heterozygosity_expected(af, 2)
uncorrected = allel.heterozygosity_expected(af, 2)
n_sample = g.count_called(axis=-1)
n_sample_correction = n_sample / (n_sample - 1)
ho = allel.heterozygosity_observed(g)
ho_correction = ho / (2 * n_sample)
return n_sample_correction * (uncorrected - ho_correction)
print('Nei 1973: HE', he_1973(g))
print('Nei 1987: HE', he_1987(g))
Nei 1973: HE [0. 0.6 0.8 1. ]
Nei 1987: HE [0. 0.66666667 1. 1. ]
For a larger sample the are similar:
g_large = GenotypeArray(np.tile(g, (1,1000,1)))
print('Nei 1973: HE', he_1973(g_large))
print('Nei 1987: HE', he_1987(g_large))
Nei 1973: HE [0. 0.50008335 0.6667778 0.83347225]
Nei 1987: HE [0. 0.50011115 0.66688896 0.83344448]
Hi @alimanfoo there are a couple of other improvements toheterozygosity_expected
which I think should be considered together with this issue:
I was going to create a PR for the original issue from @tiagoantao but it cannot be implemented
without an additional argument to specify n
(the number of alleles) because the allele frequencies don't contain this information.
Hence the functions signature would have to change to something like
heterozygosity_expected(af, ploidy, n_alleles, fill=np.nan)
which isn't vary elegant.
The established literature on calculating expected heterozygosity for polyploids typically uses the probability that a pair of alleles drawn from the population are IBD.
This means that the calculation of expected heterozygosity is identical for diploids and polyploids (of any ploidy).
Therefore a user should specify ploidy=2
even if working with polyploids.
I'm also interested in contributing an implementation of the H_blue estimator described here which is an estimator of expected heterozygosity and incorporates a kinship matrix in order to mitigate estimation bias resulting from a sample of related individuals.
If the samples are not related (kinship=0) then H_blue is identical to the corrected H_e described by @tiagoantao.
My current implementation of H_blue has the signature heterozygosity_blue(gac, kinship)
where
gac
is an instance of GenotypeAlleleCounts
.
I'm using GenotypeAlleleCounts
because this metric is suitable for mixed ploidy populations.
The relationship between H_blue and H_e means that they could be combined into a single function in which case the kinship-matrix would be optional:
heterozygosity_blue(gac, kinship=None)
So a new implementation might be the best way to resolve all of these points, any thoughts?
Hi @timothymillar, thanks a lot for following this up. This is not something I've thought deeply about so I would be more than happy to defer to your judgement here.
I might look into #287 before proposing anything here. It'd be much more consistent if most of these functions work on GenotypeArray and avoids duplicating functionality for the mixed ploidy case.
I just had a look into the FST code and realized that diversity.mean_pairwise_difference
is equivalent to the heterozygosity_expected
corrected by the total allele count (n/n-1)
as described above.
It also treats polyploid samples correctly i.e. is equivalent to heterozygosity_expected
with ploidy=2
and handles mixed ploidy assuming that sample allele counts == sample ploidy.
in short:
ac = g.count_alleles()
an = ac.sum(axis=-1)
allel.mean_pairwise_difference(ac, an) == allel.heterozygosity_expected(ac.to_frequencies(), ploidy=2) * (an / (an-1))
mean_pairwise_difference
is also not subject to the bug mentioned in #274.
So the easiest solution would be to deprecate heterozygosity_expected
and encourage the use of mean_pairwise_difference
in it's place (and update the inbreeding_coefficient
code)
I was having a look at the code for expected het and I think that people normally expect it to be reported corrected for sample size, i.e multiplied by
n/(n-1)
n being the number of alleles in the sample (2 x individuals for diploids)