emmanuelparadis / pegas

Population and Evolutionary Genetics Analysis System
GNU General Public License v2.0
27 stars 10 forks source link

Bug in FST calculation #49

Closed Ahhgust closed 3 years ago

Ahhgust commented 4 years ago

Hi! It appears that the Weir and Cockerham estimator of Fst has a bug in it. In particular, it appears that it is sensitive to the observed heterozygosity, when, if memory serves, the WC Fst is solely a function of the expected heterozygosity. Here is a simple reproducible example of two datasets wherein the allele frequencies are the same (within populations) but the estimated Fsts are different (in particular, the first population is different):

data.frame( MARKERS="Mito", INDIVIDUALS=1:20, population=c( rep("A", 10), rep("B", 10)), GT=c( paste( rep("1", 10), rep("2", 10), sep="/"), paste( rep("3", 10), rep("4", 10), sep="/") ), stringsAsFactors =TRUE ) -> dfallhet

data.frame( MARKERS="Mito", INDIVIDUALS=1:20, population=c( rep("A", 10), rep("B", 10)), GT=c( rep("1/1", 5), rep("2/2", 5), paste( rep("3", 10), rep("4", 10), sep="/") ), stringsAsFactors =TRUE ) -> dfallhom

tib<-pegas::as.loci(dfallhet, col.pop = 3, col.loci = "GT") # convert the df to an object of class loci (see package pegas) summary(tib)

tib2<-pegas::as.loci(dfallhom, col.pop = 3, col.loci = "GT") # convert the df to an object of class loci (see package pegas) summary(tib2) # allele frequencies (pooled) are the same between tib and tib2

pegas::Fst(tib, quiet=FALSE) pegas::Fst(tib2, quiet=FALSE)

To be fair, the bias appears worse when the sample size is small, so perhaps it's a corner-case that's not quite right... Regardless, thank you for looking into this! -August

emmanuelparadis commented 4 years ago

Hi,

Fst() in pegas uses Weir--Cockerham's formulas which are based on the variances of alleles. A user wrote to me recently about the same observation, so I'm posting his (adapted) example for the record. The data are simpler than yours:

X <- data.frame(L1 = factor(c("3/3", "1/1", "4/4", "4/4")),
                L2 = factor(c("3/1", "3/1", "4/4", "4/4")),
                population = factor(c("P1", "P1", "P2", "P2")))
X <- as.loci(X)

So there are two populations (P1 and P2) each with two individuals and two loci (L1 and L2). You expect the F_ST to be the same for both loci because the differentiation between the two populations is the same:

> Fst(X)
   Fit  Fst Fis
L1 1.0 0.50   1
L2 0.5 0.75  -1

The difference is because variances are poorly estimated with small sample sizes, so we increase them by simply replicating the rows:

> Fst(X[rep(1:4, 10), ])
   Fit       Fst Fis
L1 1.0 0.7368421   1
L2 0.5 0.7500000  -1
> Fst(X[rep(1:4, 100), ])
   Fit       Fst Fis
L1 1.0 0.7487437   1
L2 0.5 0.7500000  -1
> Fst(X[rep(1:4, 1000), ])
   Fit       Fst Fis
L1 1.0 0.7498749   1
L2 0.5 0.7500000  -1

Even a small number of replications make L1's F_ST converge to 0.75.

For completeness, heterozygosity-based estimators of F_ST (also called G_ST) are in the package mmod: the standard formula is also affected by sample sizes, while the corrected formula (G_ST') is not and is, in this case, equal to one (which makes sense since there is total differentiation between the two populations).

Cheers,

Emmanuel

Ahhgust commented 4 years ago

Thanks Emmanuel! I think you're been a little too kind in your response, but I thank your for it regardless. Because of your comments I went back and did my due-diligence; I re-read the WC-84, and placed it in its historical context (in part at least). After re-reading the original paper I see two different ways to estimate Fst (the estimator θ ) per WC-84. The first uses the observed heterozygosity (h-bar, from formulas 2,3,4 in WC-84), while the second uses the (information content of the) expected heterozygosity (i.e., p*(1-p) ) (though the term "expected heterozygosity" is admittedly wanting). The second formulation is more clearly presented in other works (e.g., Bhatia et al. 2013), but it is also only presented in terms of biallelic sites (which I suppose gets points for historical consistency).

Reading through your code I see that you are using the θu estimator, which in turn relates to the observed heterozygosity. Further, it tells me that my original supposition is, well, wrong. The multiallelic θ does in fact use the observed heterozygosity (!) in the original presentation, a fact that I was quite ignorant of! The relationship between the observed heterozygosity and what Fst represents seems... tenuous to me (for one, allele-correlations within vs between populations is well defined when the ploidy is not 2, unlike in this presentation), but I suppose that's an aside. I would also wager that may be why the p(1-p) version is more commonly seen (at least in the human genetics world), and was the reason for my original comment.

Again, thank you for your response. -August