cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
289 stars 51 forks source link

Not the typical expected heterozygosity computation #145

Open tiagoantao opened 7 years ago

tiagoantao commented 7 years ago

I was having a look at the code for expected het and I think that people normally expect it to be reported corrected for sample size, i.e multiplied by

n/(n-1)

n being the number of alleles in the sample (2 x individuals for diploids)

alimanfoo commented 7 years ago

Thanks Tiago. PR welcome if you have any time.

timothymillar commented 5 years ago

I've been looking into this and their appear to be two common variations of correction for HE.

Nei and Roychoudhury (1973) describe the method Tiago mentions:

[1] (n_alleles / (n_alleles - 1)) * uncorrected

Nei and Chesser (1983) describe a method using HO :

[2] (n_samples / (n_samples - 1)) * (uncorrected - (ho/(2*n_samples)))

It's a bit confusing because both are written with just n but the n has different meaning in each equation.

In the second paper, they say that the first formula is for estimating from population frequencies but that "he has not yet shown how to estimate these quantities for sample allele frequencies", They correction for the second variation incorporates the sampling of genotypes from a multinomial distribution.

(Edit: I was mistaken, the Nei and Chesser (1983) paper doesn't reference the Nei and Roychoudhury (1973) paper)

These variations can differ quite a bit for small sample sizes:

g = allel.GenotypeArray([
    [[0, 0], [0, 0], [0, 0]],
    [[0, 0], [0, 1], [1, 1]],
    [[0, 0], [1, 1], [2, 2]],
    [[0, 1], [2, 3], [4, 5]],
])

def he_1973(g):
    ac = g.count_alleles()
    af = ac.to_frequencies()
    uncorrected = allel.heterozygosity_expected(af, 2)

    n_allele = np.sum(ac, axis=-1)
    n_allele_correction = n_allele / (n_allele - 1)

    return n_allele_correction * uncorrected

def he_1987(g):
    ac = g.count_alleles()
    af = ac.to_frequencies()
    uncorrected = allel.heterozygosity_expected(af, 2)
    uncorrected = allel.heterozygosity_expected(af, 2)

    n_sample = g.count_called(axis=-1)
    n_sample_correction = n_sample / (n_sample - 1)

    ho = allel.heterozygosity_observed(g)
    ho_correction = ho / (2 * n_sample)

    return n_sample_correction * (uncorrected - ho_correction)

print('Nei 1973: HE', he_1973(g))
print('Nei 1987: HE', he_1987(g))
Nei 1973: HE [0.         0.6        0.8        1.        ]
Nei 1987: HE [0.         0.66666667 1.         1.        ]

For a larger sample the are similar:

g_large = GenotypeArray(np.tile(g, (1,1000,1)))

print('Nei 1973: HE', he_1973(g_large))
print('Nei 1987: HE', he_1987(g_large))
Nei 1973: HE [0.         0.50008335 0.6667778  0.83347225]
Nei 1987: HE [0.         0.50011115 0.66688896 0.83344448]
timothymillar commented 4 years ago

Hi @alimanfoo there are a couple of other improvements toheterozygosity_expected which I think should be considered together with this issue:

Correction for sample size

I was going to create a PR for the original issue from @tiagoantao but it cannot be implemented without an additional argument to specify n (the number of alleles) because the allele frequencies don't contain this information. Hence the functions signature would have to change to something like heterozygosity_expected(af, ploidy, n_alleles, fill=np.nan) which isn't vary elegant.

Deprecation of ploidy argument

The established literature on calculating expected heterozygosity for polyploids typically uses the probability that a pair of alleles drawn from the population are IBD. This means that the calculation of expected heterozygosity is identical for diploids and polyploids (of any ploidy). Therefore a user should specify ploidy=2 even if working with polyploids.

Using known kinship information

I'm also interested in contributing an implementation of the H_blue estimator described here which is an estimator of expected heterozygosity and incorporates a kinship matrix in order to mitigate estimation bias resulting from a sample of related individuals.

If the samples are not related (kinship=0) then H_blue is identical to the corrected H_e described by @tiagoantao.

Combining implementations

My current implementation of H_blue has the signature heterozygosity_blue(gac, kinship) where gac is an instance of GenotypeAlleleCounts. I'm using GenotypeAlleleCounts because this metric is suitable for mixed ploidy populations.

The relationship between H_blue and H_e means that they could be combined into a single function in which case the kinship-matrix would be optional: heterozygosity_blue(gac, kinship=None)

So a new implementation might be the best way to resolve all of these points, any thoughts?

alimanfoo commented 4 years ago

Hi @timothymillar, thanks a lot for following this up. This is not something I've thought deeply about so I would be more than happy to defer to your judgement here.

timothymillar commented 4 years ago

I might look into #287 before proposing anything here. It'd be much more consistent if most of these functions work on GenotypeArray and avoids duplicating functionality for the mixed ploidy case.

timothymillar commented 4 years ago

I just had a look into the FST code and realized that diversity.mean_pairwise_difference is equivalent to the heterozygosity_expected corrected by the total allele count (n/n-1) as described above.

It also treats polyploid samples correctly i.e. is equivalent to heterozygosity_expected with ploidy=2 and handles mixed ploidy assuming that sample allele counts == sample ploidy.

in short:

ac = g.count_alleles()
an = ac.sum(axis=-1)
allel.mean_pairwise_difference(ac, an) == allel.heterozygosity_expected(ac.to_frequencies(), ploidy=2) * (an / (an-1))

mean_pairwise_difference is also not subject to the bug mentioned in #274.

So the easiest solution would be to deprecate heterozygosity_expected and encourage the use of mean_pairwise_difference in it's place (and update the inbreeding_coefficient code)