cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 49 forks source link

Use gametic heterozygosity for heterozygosity_observed calculation #277

Open timothymillar opened 5 years ago

timothymillar commented 5 years ago

Hi, thanks for scikit-allel.

It looks like the current implementation of heterozygosity_observed is suitable for diploids but not polyploids.

The current implementation treats all heterozygous genotypes as being equally heterozygous and gives them a value of 1. i.e. it counts the number of heterozygous individuals at a locus.

The "gametic heterozygosity" developed in Moody et al. (1993) (originally for tetraploids but then recommended by Meirmans et al. (2018) for all polyploids) defines heterozygosity as the proportion of hypothetical heterozygous diploid gametes formed by the alleles of a genotype.

E.g. for a population of one tetraploid:

g = allel.GenotypeArray([
    [[0, 0, 0, 0]],
    [[0, 0, 0, 1]],
    [[0, 0, 1, 1]],
    [[0, 0, 1, 2]],
    [[0, 1, 2, 3]],
])

allel.heterozygosity_observed(g)

returns array([0., 1., 1., 1., 1.])

and

gametic_heterozygosity_observed(g)

returns array([0., 0.5, 0.66666667, 0.83333333, 1.])

So only a fully heterozygous locus returns a value of 1.

For diploids there is no difference between these methods because all heterozygous loci are fully heterozygous (within the individual).

Meirmans et al. (2018) also recommend calculating expected heterozygosity as if the population was diploid regardless of it's actual ploidy (for similar reasons). This is currently possible with the scikit-allel implementation of heterozygosity_expected, however inbreeding_coefficient is hard-coded to use the ploidy of the genotype array for heterozygosity_expected so this should possibly be changed or at least made optional.

I can put together a pull request if these changes sound appropriate?

alimanfoo commented 5 years ago

Hi Tim, this sounds like an excellent improvement, I'm always keen to generalise to any ploidy where possible, please feel free to submit a PR.

On Wed, 10 Jul 2019 at 23:33, Tim Millar notifications@github.com wrote:

Hi, thanks for scikit-allel.

It looks like the current implementation of heterozygosity_observed is suitable for diploids but not polyploids.

The current implementation treats all heterozygous genotypes as being equally heterozygous and gives them a value of 1. i.e. it counts the number of heterozygous individuals at a locus.

The "gametic heterozygosity" developed in Moody et al. (1993) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1205504/ (originally for tetraploids but then recommended by Meirmans et al. (2018) https://academic.oup.com/jhered/article/109/3/283/4827622 for all polyploids) defines heterozygosity as the proportion of hypothetical heterozygous diploid gametes formed by the alleles of a genotype.

E.g. for a population of one tetraploid:

g = allel.GenotypeArray([ [[0, 0, 0, 0]], [[0, 0, 0, 1]], [[0, 0, 1, 1]], [[0, 0, 1, 2]], [[0, 1, 2, 3]], ])

allel.heterozygosity_observed(gts)

returns array([0., 1., 1., 1., 1.])

and

gametic_heterozygosity_observed(g)

returns array([0., 0.5, 0.66666667, 0.83333333, 1.])

So only a fully heterozygous locus returns a value of 1.

For diploids there is no difference between these methods because all heterozygous loci are fully heterozygous (within the individual).

Meirmans et al. (2018) https://academic.oup.com/jhered/article/109/3/283/4827622 also recommend calculating expected heterozygosity as if the population was diploid regardless of it's actual ploidy (for similar reasons). This is currently possible with the scikit-allel implementation of heterozygosity_expected, however inbreeding_coefficient is hard-coded to use the ploidy of the genotype array for heterozygosity_expected so this should possibly be changed or at least made optional.

I can put together a pull request if these changes sound appropriate?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cggh/scikit-allel/issues/277?email_source=notifications&email_token=AAFLYQVSQPHWT6Z4L4RKC6DP62ZZDA5CNFSM4IAR35R2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G6QT6XQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFLYQVSQG72GXLV3M5VY63P62ZZDANCNFSM4IAR35RQ .

--

Alistair Miles Head of Epidemiological Informatics Centre for Genomics and Global Health Big Data Institute Li Ka Shing Centre for Health Information and Discovery University of Oxford Old Road Campus Headington Oxford OX3 7LF United Kingdom Phone: +44 (0)1865 743596 or +44 (0)7866 541624 Email: alimanfoo@googlemail.com Web: http://a http://purl.org/net/alimanlimanfoo.github.io/ Twitter: @alimanfoo https://twitter.com/alimanfoo

Please feel free to resend your email and/or contact me by other means if you need an urgent reply.

timothymillar commented 5 years ago

Hi Alistair, this calculation requires allele frequencies at the sample level rather than population. I found GenotypeAlleleCountsArray which is ideal, but I can't currently see a method to transform a GenotypeArray into GenotypeAlleleCountsArray.

Would adding a method GenotypeArray.count_alleles_samples be a good solution? This would return a GenotypeAlleleCountsArray with dimentions (n_variants, n_samples, max_allele). I can add a cython implementation to opt/model.pyx to keep it fast.

With this method heterozygosity_observed(g) would be calculate something like:

g = allel.GenotypeArray([
    [[0,0,0,0], [0,0,0,0]],
    [[0,0,1,1], [0,0,1,1]],
    [[0,1,2,2], [0,1,2,2]],
])

freqs = g.count_alleles_samples().to_frequencies()
hi_uncorrected = 1 - np.power(freqs, 2).sum(axis=-1)
hi_corrected = hi_uncorrected * g.ploidy / (g.ploidy - 1) 
ho = hi_corrected.sum(axis=-1) / g.count_called(axis=-1)
ho

array([0. , 0.66666667, 0.83333333])

where g.count_alleles_samples() returns:

<GenotypeAlleleCountsArray shape=(3, 2, 3) dtype=int32>
4:0:0 4:0:0
2:2:0 2:2:0
1:1:2 1:1:2
alimanfoo commented 5 years ago

Hi Tim, there should be a to_allele_counts() method on the GenotypeArray class that converts to GenotypeAlleleCountsArray.

On Sun, 14 Jul 2019, 10:58 Tim Millar, notifications@github.com wrote:

Hi Alistair, this calculation requires allele frequencies at the sample level rather than population. I found GenotypeAlleleCountsArray which is ideal, but I can't currently see a method to transform a GenotypeArray into GenotypeAlleleCountsArray.

Would adding a method GenotypeArray.count_alleles_samples be a good solution? This would return a GenotypeAlleleCountsArray with dimentions (n_variants, n_samples, max_allele). I can add a cython implementation to opt/model.pyx to keep it fast.

With this method heterozygosity_observed(g) would be calculate something like:

g = allel.GenotypeArray([ [[0,0,0,0], [0,0,0,0]], [[0,0,1,1], [0,0,1,1]], [[0,1,2,2], [0,1,2,3]], ])

freqs = g.count_alleles_samples().to_frequencies() hi_uncorrected = 1 - np.power(freqs, 2).sum(axis=-1) hi_corrected = hi_uncorrected * g.ploidy / (g.ploidy - 1) ho = hi_corrected.sum(axis=-1) / g.count_called(axis=-1) ho

array([0. , 0.66666667, 0.86111111])

where g.count_alleles_samples() returns:

<GenotypeAlleleCountsArray shape=(3, 2, 3) dtype=int32> 4:0:0 4:0:0 2:2:0 2:2:0 1:1:2 1:1:1

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cggh/scikit-allel/issues/277?email_source=notifications&email_token=AAFLYQSTDD72Y3D4JBEY54LP7L2EPA5CNFSM4IAR35R2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ4CB2Q#issuecomment-511189226, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFLYQVPNPUVVJRWOA3YLKDP7L2EPANCNFSM4IAR35RQ .

alimanfoo commented 5 years ago

Sorry docs are a bit sprawling, easy to miss.

On Sun, 14 Jul 2019, 20:58 Alistair Miles, alimanfoo@googlemail.com wrote:

Hi Tim, there should be a to_allele_counts() method on the GenotypeArray class that converts to GenotypeAlleleCountsArray.

On Sun, 14 Jul 2019, 10:58 Tim Millar, notifications@github.com wrote:

Hi Alistair, this calculation requires allele frequencies at the sample level rather than population. I found GenotypeAlleleCountsArray which is ideal, but I can't currently see a method to transform a GenotypeArray into GenotypeAlleleCountsArray.

Would adding a method GenotypeArray.count_alleles_samples be a good solution? This would return a GenotypeAlleleCountsArray with dimentions (n_variants, n_samples, max_allele). I can add a cython implementation to opt/model.pyx to keep it fast.

With this method heterozygosity_observed(g) would be calculate something like:

g = allel.GenotypeArray([ [[0,0,0,0], [0,0,0,0]], [[0,0,1,1], [0,0,1,1]], [[0,1,2,2], [0,1,2,3]], ])

freqs = g.count_alleles_samples().to_frequencies() hi_uncorrected = 1 - np.power(freqs, 2).sum(axis=-1) hi_corrected = hi_uncorrected * g.ploidy / (g.ploidy - 1) ho = hi_corrected.sum(axis=-1) / g.count_called(axis=-1) ho

array([0. , 0.66666667, 0.86111111])

where g.count_alleles_samples() returns:

<GenotypeAlleleCountsArray shape=(3, 2, 3) dtype=int32> 4:0:0 4:0:0 2:2:0 2:2:0 1:1:2 1:1:1

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cggh/scikit-allel/issues/277?email_source=notifications&email_token=AAFLYQSTDD72Y3D4JBEY54LP7L2EPA5CNFSM4IAR35R2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZ4CB2Q#issuecomment-511189226, or mute the thread https://github.com/notifications/unsubscribe-auth/AAFLYQVPNPUVVJRWOA3YLKDP7L2EPANCNFSM4IAR35RQ .

timothymillar commented 5 years ago

I read the source for GenotypeArray, not sure how I missed it 😄

Could I please get your opinion on whether to update the existing heterozygosity function(s) or add new ones.

I'm following Hardy 2016 who describes ho and he calculations for auto-polyploids (i.e. assumes polysomic inheritance) but these are not correct for disomc inheritance unless genotypes are encoded as diploids.

In addition, updating the current heterozygosity_expected to match Hardy 2016 would involve adding the sum of individual observed heterozygosity (sum(hi)) as an additional argument which seems messy.

The other option would be to write new functions explicitly for polysomic inheritance which is probably cleaner

alimanfoo commented 5 years ago

I read the source for GenotypeArray, not sure how I missed it

It's on the Genotypes class which both GenotypeArray and GenotypeVector inherit from, seemed like a good piece of refactoring at the time but there's a few times I've had trouble finding stuff forgetting it was in the superclass.

Could I please get your opinion on whether to update the existing heterozygosity function(s) or add new ones.

I'm following Hardy 2016 who describes ho and he calculations for auto-polyploids (i.e. assumes polysomic inheritance) but these are not correct for disomc inheritance unless genotypes are encoded as diploids.

In addition, updating the current heterozygosity_expected to match Hardy 2016 would involve adding the sum of individual observed heterozygosity (sum(hi)) as an additional argument which seems messy.

The other option would be to write new functions explicitly for polysomic inheritance which is probably cleaner

  • polysomic_heterozygosity_observed
  • polysomic_heterozygosity_expected
  • polysomic_heterozygosity_individual (analogous to GenotypeArray.is_het())

I'm very happy to defer to your judgement in this, it sounds like adding new functions would be a good way to go.

timothymillar commented 5 years ago

Hi Alistair, just a note about why I decided to update the existing heterozygosity_observed rather than a new function.

I don't think there is a concern with allo-poylploids because (from my understanding) these are typically called as diploids against multiple sub-genomes. If an allo-poylploid is called using polyploid genotypes then the existing implementation would still produce an incorrect result so there is no loss of applicability.