cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 49 forks source link

Interpreting Fst values #320

Open standage opened 4 years ago

standage commented 4 years ago

Hi, thanks for making this library available!

I used the Weir/Cockerham formulation to compute FST values for a set of 412 markers (microhaplotypes) across 26 human population samples (from the 1000 Genomes Project). I'm now comparing these values to other measures of allelic variation I've computed previously: effective number of alleles (Ae) and Rosenberg's informativeness for assignment (In).

microhaps-ae-in-fst

I have a pretty good intuitive understanding of Ae and In, but less so for FST. In your documentation you note that it is possible for FST values to be negative. How should I interpret this? There are a handful of outliers with extremely low FST values: these are all on the X chromosome. Is including these in the calculation problematic?


Some more background, if interested.

standage commented 4 years ago

Refreshing myself with Wikipedia, and the following seemed relevant.

The interpretation of FST can be difficult when the data analyzed are highly polymorphic. In this case, the probability of identity by descent is very low and FST can have an arbitrarily low upper bound, which might lead to misinterpretation of the data.

Microhaplotypes are certainly more polymorphic than SNPs. Most microhap markers are defined by 3-6 SNPs, but the markers with the highest Ae and In values are defined by dozens of SNPs. These most polymorphic markers (in green above) all have FST values near 0.

standage commented 4 years ago

cc @rnmitchell

alimanfoo commented 4 years ago

In your documentation you note that it is possible for FST values to be negative. How should I interpret this?

Hi @standage, sorry for slow response, in my limited understanding, the various Fst estimators (W&C, Hudson) can produce negative values, but negative values don't have any meaning for Fst, so negative values are usually clipped to zero.

My intuition for Fst is that it measures variance in allele frequencies between two populations. So in theory it shouldn't matter how many alleles are present at a locus. However, I haven't investigated how the different estimators behave in practice.

XIA1112 commented 2 years ago

Hi, I am confused about the negative values of In (MH locus). Have you ever encountered negative values when calculating MH