Interpreting Fst values

cggh / scikit-allel

A Python package for exploring and analysing genetic variation data

MIT License

287 stars 49 forks source link

Interpreting Fst values #320

Open standage opened 4 years ago

standage commented 4 years ago

Hi, thanks for making this library available!

I used the Weir/Cockerham formulation to compute F_ST values for a set of 412 markers (microhaplotypes) across 26 human population samples (from the 1000 Genomes Project). I'm now comparing these values to other measures of allelic variation I've computed previously: effective number of alleles (A_e) and Rosenberg's informativeness for assignment (I_n).

microhaps-ae-in-fst

I have a pretty good intuitive understanding of A_e and I_n, but less so for F_ST. In your documentation you note that it is possible for F_ST values to be negative. How should I interpret this? There are a handful of outliers with extremely low F_ST values: these are all on the X chromosome. Is including these in the calculation problematic?

Some more background, if interested.

standage commented 4 years ago

Refreshing myself with Wikipedia, and the following seemed relevant.

The interpretation of F_ST can be difficult when the data analyzed are highly polymorphic. In this case, the probability of identity by descent is very low and F_ST can have an arbitrarily low upper bound, which might lead to misinterpretation of the data.

Microhaplotypes are certainly more polymorphic than SNPs. Most microhap markers are defined by 3-6 SNPs, but the markers with the highest A_e and I_n values are defined by dozens of SNPs. These most polymorphic markers (in green above) all have F_ST values near 0.

standage commented 4 years ago

cc @rnmitchell

alimanfoo commented 4 years ago

In your documentation you note that it is possible for FST values to be negative. How should I interpret this?

Hi @standage, sorry for slow response, in my limited understanding, the various Fst estimators (W&C, Hudson) can produce negative values, but negative values don't have any meaning for Fst, so negative values are usually clipped to zero.

My intuition for Fst is that it measures variance in allele frequencies between two populations. So in theory it shouldn't matter how many alleles are present at a locus. However, I haven't investigated how the different estimators behave in practice.

XIA1112 commented 2 years ago

Hi， I am confused about the negative values of In （MH locus). Have you ever encountered negative values when calculating MH