ANGSD / angsd

Program for analysing NGS data.
228 stars 50 forks source link

Fst value strongly affected by very low sample size, even with Bhatia -whichFst 1 option #514

Open TeresaPegan opened 2 years ago

TeresaPegan commented 2 years ago

Hi, I frequently use ANGSD to make Fst calculations and I recently tried a calculation on a subset of my data where I have 6 populations, but in this instance I happen to have only 1 individual from two of the 6 populations. (Population sizes in the other 4 populations are all around 8-15). I have observed that even when I used -whichFst 1, which uses the Bhatia estimator that is supposed to improve estimation when population sizes are very imbalanced, the Fst values I get involving the single-sample populations are highly different from all of the others. This is illustrated in the attached plot of Fst/1-Fst vs geographic distance. The single-sample populations are QC and NMB on the plot.

You can see that there are basically 3 levels of Fst values. The upper range of Fst estimates involves one of the single-sample populations (QC or NMB) compared with one of the other multisample populations. The second group is comparisons between two multisample populations: these have much lower Fst estimates than comparisons involving one of the single-sample populations. Finally, when the two single-sample populations are compared, they are given a negative Fst value. It seems likely that all of the Fst values involving QC or NMB are highly biased in some way that makes them not biologically interpretable.

Evidently I should not use populations with single samples in them. I thought it would be worth pointing this out because it raises some questions about when the Bhatia estimator actually works well with ANGSD data: the estimator apparently alleviates problems that arise from some level of sample size imbalance, but only to a certain extent: it can't deal with single sample populations. What level of sample imbalance is appropriate? How can I be sure that sample size imbalance is not affecting all of my Fst comparisons, just not in such an extremely observable way? Can I trust the Fst values calculated between multisample populations?

I appreciate any insights people have! :) Thanks, -Teresa image

TonyKess commented 2 years ago

I've run into variable FST results when my sample size/coverage combo was low (e.g. 10 inds, 1.5x coverage/ind) when using the SFS methods, and lots of negative values. Interestingly, PCANGSD and poolseq FST estimates were not impacted. Something similar is probably at play here - there's evidence of fairly accurate differentiation estimates from low (n ~ 5) samples when lots of SNPs are covered , but I would avoid using FST, which is a group allele frequency based estimate, for single samples. Maybe a method like ohana would work better?

TeresaPegan commented 2 years ago

Thank you for the helpful insights! I took a look at ohana, but I didn't see methods in that package that seem like they are designed to specifically look at population connectivity across space (which is the reason why I've been looking into pairwise Fst -- to look at isolation-by-distance slopes, which I am comparing across many species). I'm less interested in things like admixture proportions, which seems to be ohana's strong suit. Let me know if there is an Fst method associated with ohana that I missed? For my purposes, perhaps I should simply restrict my Fst analyses to populations where I have at least several individuals. Thanks, -Teresa

TonyKess commented 1 year ago

Hi again, A couple more thoughts - you could use PCANGSD and use individuals' PC scores as measures of genetic distance, provided there is enough IBD that PCs identified match spatial variation. You could also use the PCA to bin your single individuals into larger groups to compare FST values between. Another option may be distAngsd . Hopefully something here helps!