understanding pcadapt - Githubissues

peter-civan commented 2 years ago

Hello,

I am trying to identify candidate loci under selection in different cereal species, and I'm exploring several methods, including your PCAdapt.

Perhaps the most common problem with pcadapt is to pick a 'good' number of PCs (in the case of a hierarchical population structure). I tried several values of K, and to my surprise, I noticed that the number of detected outliers increases with higher K numbers.

The Luu et al. 2017 paper says that pcadapt 'assumes that candidate markers are outliers with respect to how they are related to population structure'. But it also says pcadapt assumes that 'markers excessively related to population structure are candidates for local adaptation', which in my understanding is not the same.

My understanding of the logic behind pcadapt is this: Let's consider 2 differentiated populations where each of them carries a SNP that is adaptive to drought. The frequency of this allele is 0.2 in each population (Fst=0). At low K, this SNP will be identified by pcadapt as an outlier, because it does not correlate well with K number of PCs. If we increase K much higher, we may actually include a PC that is not related to population structure, but represents this particular SNP and other SNPs in LD with it. Our adaptive SNP will be correlated with that PC, and therefore not detected as an outlier. Consequently, we would expect less outliers with higher K values.

A colleague of mine has a different understanding. He thinks pcadapt identifies SNPs that contribute disproportionately to a given PC (i.e. more than expected). In the example above, we would not identify the drought-adaptive SNP at low K by pcadapt, because the SNP is not correlated at all to the top PCs. However, we might identify it at higher levels of K if a PC that corresponds to drought adaptation is picked, and that's why we detect more outliers with increasing K.

Which understanding (if any) is the correct one? And how would this work if the SNP is present in only one of those populations?

I'm also puzzled about the relationship between pcadapt outliers and Fst. Are they correlated? My expectation is that pcadapt would identify a subset of positions with high inter-population Fst, and therefore pcadapt can be used to distinguish regions that have high Fst due to population structure and due to other reasons (selection). But this is hard to rationalize with the example above.

Any advice will be greatly appreciated.

Peter

privefl commented 2 years ago

The statistic used in pcadapt summarizes the contribution to all PCs used. It is then important to use as many PCs that capture population structure as possible, but without capturing something else (e.g. LD).

The scree plot should give an initial idea of the number of PCs to choose. Then looking at the PC scores can further identify PCs that capture pop structure (versus just noise). Looking at PC loadings can identify PCs that capture LD (huge localized peaks). You might want to read https://doi.org/10.1093/bioinformatics/btaa520, which talks about this. Capture a bit of noise in pcadapt should be fine, but capturing LD is not, as this will give you false positives.

I think pcadapt should give similar results than the Fst when there are two very distinct populations only, and could be seen as more general as it can handle more populations, and more continuous ones (e.g. with adxmiture).

peter-civan commented 2 years ago

Many thanks for the quick response. I might have given the wrong impression in my previous post. I did not mean to ask how to choose the best K. Rather, I am asking about the logic and interpretation of pcadapt (since I was surprised that outliers rise with K). If 'The statistic used in pcadapt summarizes the contribution to all PCs used', then how to understand an outlier? Is it an SNP that contributes too little/nothing to the PCs used? (see 'my understanding' above), or is it a SNP that contributes more to those PCs than what's normal for most SNPs? (see my colleague's understanding). And particularly, would we detect an adaptive SNP in a dataset of two populations, if we use an appropriate K and (a) the variant is present in both populations at the same frequency? (b) the variant is present only in one population (where it may be fixed)?

Cheers

privefl commented 2 years ago

If more PCs capture more population structure, then it is normal you capture more outliers
Outliers are the ones contributing most to pop structure (the ones you're interested in)
You would detect (b)

desa-la commented 1 year ago

Hi @privefl I didn't want to open a new issue since it is not really an issue but rather clarification question related to this topic to understand better pcadapt. I am having troubles understanding how we need to account for structure as confounding effect in outlier studies and here you write (and in the pcadapt paper it states the same) that Outliers are the ones contributing most to pop structure (the ones you're interested in). Is it that Pcadapt captures neutral structure and than those that differ are considered outliers? Basically I am asking how are the potential adaptive loci and loci "responsible" for structure considered the same?

privefl commented 1 year ago

Indeed, pcadapt captures variants associated with population structure. However, "responsible" might be understood as causal, which they are probably not.

We call these variants "outliers" because of the statistical framework that is used to detect them.

bcm-uga / pcadapt

understanding pcadapt #74