Not the typical expected heterozygosity computation

tiagoantao commented 7 years ago

I was having a look at the code for expected het and I think that people normally expect it to be reported corrected for sample size, i.e multiplied by

n/(n-1)

n being the number of alleles in the sample (2 x individuals for diploids)

alimanfoo commented 7 years ago

Thanks Tiago. PR welcome if you have any time.

timothymillar commented 5 years ago

I've been looking into this and their appear to be two common variations of correction for HE.

Nei and Roychoudhury (1973) describe the method Tiago mentions:

[1] (n_alleles / (n_alleles - 1)) * uncorrected

Nei and Chesser (1983) describe a method using HO :

[2] (n_samples / (n_samples - 1)) * (uncorrected - (ho/(2*n_samples)))

It's a bit confusing because both are written with just n but the n has different meaning in each equation.

In the second paper, they say that ~~the first formula is for estimating from population frequencies but that~~ "he has not yet shown how to estimate these quantities for sample allele frequencies", They correction for the second variation incorporates the sampling of genotypes from a multinomial distribution.

(Edit: I was mistaken, the Nei and Chesser (1983) paper doesn't reference the Nei and Roychoudhury (1973) paper)

These variations can differ quite a bit for small sample sizes:

g = allel.GenotypeArray([
    [[0, 0], [0, 0], [0, 0]],
    [[0, 0], [0, 1], [1, 1]],
    [[0, 0], [1, 1], [2, 2]],
    [[0, 1], [2, 3], [4, 5]],
])

def he_1973(g):
    ac = g.count_alleles()
    af = ac.to_frequencies()
    uncorrected = allel.heterozygosity_expected(af, 2)

    n_allele = np.sum(ac, axis=-1)
    n_allele_correction = n_allele / (n_allele - 1)

    return n_allele_correction * uncorrected

def he_1987(g):
    ac = g.count_alleles()
    af = ac.to_frequencies()
    uncorrected = allel.heterozygosity_expected(af, 2)
    uncorrected = allel.heterozygosity_expected(af, 2)

    n_sample = g.count_called(axis=-1)
    n_sample_correction = n_sample / (n_sample - 1)

    ho = allel.heterozygosity_observed(g)
    ho_correction = ho / (2 * n_sample)

    return n_sample_correction * (uncorrected - ho_correction)

print('Nei 1973: HE', he_1973(g))
print('Nei 1987: HE', he_1987(g))

Nei 1973: HE [0.         0.6        0.8        1.        ]
Nei 1987: HE [0.         0.66666667 1.         1.        ]

For a larger sample the are similar:

g_large = GenotypeArray(np.tile(g, (1,1000,1)))

print('Nei 1973: HE', he_1973(g_large))
print('Nei 1987: HE', he_1987(g_large))

Nei 1973: HE [0.         0.50008335 0.6667778  0.83347225]
Nei 1987: HE [0.         0.50011115 0.66688896 0.83344448]

timothymillar commented 4 years ago

Hi @alimanfoo there are a couple of other improvements toheterozygosity_expected which I think should be considered together with this issue:

Correction for sample size

I was going to create a PR for the original issue from @tiagoantao but it cannot be implemented without an additional argument to specify n (the number of alleles) because the allele frequencies don't contain this information. Hence the functions signature would have to change to something like heterozygosity_expected(af, ploidy, n_alleles, fill=np.nan) which isn't vary elegant.

Deprecation of ploidy argument

The established literature on calculating expected heterozygosity for polyploids typically uses the probability that a pair of alleles drawn from the population are IBD. This means that the calculation of expected heterozygosity is identical for diploids and polyploids (of any ploidy). Therefore a user should specify ploidy=2 even if working with polyploids.

Using known kinship information

I'm also interested in contributing an implementation of the H_blue estimator described here which is an estimator of expected heterozygosity and incorporates a kinship matrix in order to mitigate estimation bias resulting from a sample of related individuals.

If the samples are not related (kinship=0) then H_blue is identical to the corrected H_e described by @tiagoantao.

Combining implementations

My current implementation of H_blue has the signature heterozygosity_blue(gac, kinship) where gac is an instance of GenotypeAlleleCounts. I'm using GenotypeAlleleCounts because this metric is suitable for mixed ploidy populations.

The relationship between H_blue and H_e means that they could be combined into a single function in which case the kinship-matrix would be optional: heterozygosity_blue(gac, kinship=None)

So a new implementation might be the best way to resolve all of these points, any thoughts?

alimanfoo commented 4 years ago

Hi @timothymillar, thanks a lot for following this up. This is not something I've thought deeply about so I would be more than happy to defer to your judgement here.

timothymillar commented 4 years ago

I might look into #287 before proposing anything here. It'd be much more consistent if most of these functions work on GenotypeArray and avoids duplicating functionality for the mixed ploidy case.

timothymillar commented 4 years ago

I just had a look into the FST code and realized that diversity.mean_pairwise_difference is equivalent to the heterozygosity_expected corrected by the total allele count (n/n-1) as described above.

It also treats polyploid samples correctly i.e. is equivalent to heterozygosity_expected with ploidy=2 and handles mixed ploidy assuming that sample allele counts == sample ploidy.

in short:

ac = g.count_alleles()
an = ac.sum(axis=-1)
allel.mean_pairwise_difference(ac, an) == allel.heterozygosity_expected(ac.to_frequencies(), ploidy=2) * (an / (an-1))

mean_pairwise_difference is also not subject to the bug mentioned in #274.

So the easiest solution would be to deprecate heterozygosity_expected and encourage the use of mean_pairwise_difference in it's place (and update the inbreeding_coefficient code)

cggh / scikit-allel