cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
283 stars 49 forks source link

Question: Is `allel.stats.hudson_fst()` actually unbiased following Bhatia et al. 2013? #407

Open taprs opened 5 months ago

taprs commented 5 months ago

Hi scikit-allel team and thanks for doing a good job!

Checking the implementation of Hudson's Fst estimator, I had an impression that despite the reference to Bhatia et al. (2013) the allel.stats.hudson_fst() function does not account for the systematic bias of using $π_{within}$ as estimator for $Hw$ and $π{between}$ as estimator for $H_b$. It just does

$$ F{st} = { { π{between}-π{within} } \over { π{between} } } $$

, right? These are naive estimators for numerator and denominator sensu Bhatia et al. (2013), the way to correct them is shown in equation (10) in the paper (and there is the nice section on its justification and derivation in Supplementary materials).

image

where $n_1$ and $n_2$ are allele counts for populations 1 and 2 and $p_1$ and $p_2$ are their allele frequencies.

Did I miss the place where the bias is eventually accounted for, or should the function be modified (or at least the note about using unbiased estimator removed)? I can try to work on fixing the function if this is the case.

Cheers, Nikita

taprs commented 5 months ago

Update: I tried to mimic the formula above and compare with hudson_fst() and the results are the same. Apparently I am missing something in the source code...