cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 49 forks source link

Heterozygosity Blue #335

Open timothymillar opened 4 years ago

timothymillar commented 4 years ago

Best linear unbiased estimate of expected heterozygosity following eq. 8 in Harris and DeGiorgio (2017).

Heterozygosity_blue is a variation of heterozygosity_expected which minimizes bias due to related individuals based on a kniship matrix. This implementation makes the kinship matrix an optional argument.

If kinship is not specified then "regular" heterozygosity_expected is calculated using the (n / (n-1)) correction described in #145 (resolving that issue). This is a sensible 'default' because heterozygosity_blue is equivalent to heterozygosity_expected when the kinship matrix indicates non-related, non-inbred individuals (a matrix of zeros with a diagonal of 1/ploidy).

The implementation is a little more complex than ideal because the inverse of the kinship matrix must be calculated for only the called samples at each variant. I have tried to do this in a way that minimizes recalculation of the inverse matrix. Additionally a nan value is returned if the (sub-)kinship matrix has no inverse.

It's worth noting that heterozygosity_blue can return values outside of the interval [0, 1] with some inputs.

Related to #287 there is an optional argument to specify mixed ploidy data