Closed weiwFu closed 3 years ago
What do you mean by "the best choice of K from the R package"? This choice is made by the user, based on looking at the screeplot and PC scores.
I guess it is more the case K=1 that corresponds to Fst. Because with K PCs, you can separate K+1 populations.
pcadapt is really useful when K>1 (i.e. there are more than 2 populations) I'm not sure this is designed to look at each pair of population (although you could do this by first filtering the bed/bim/fam files with e.g. PLINK and running each population pairs separetely -> I think this was discussed in another issue here).
Thank you very much for your reply. I want to get the optimal k directly. This is very useful for me to do batch operations and put this parameter directly to the next program. Whether pcadapt can directly output the optimal K value to me, not in a graphic way?
Also thank you for your reply to the selection signal. If I have one group, pcadapt would no longer apply?
If I have more than 2 populations, how do I distinguish which group (or more than one) is under selection.
If possible, can you give an example of the application of multi group selection signal, which will be very important and helpful for me to understand this software.
Estimating K programmatically can be tricky.
To apply pcadapt, you need population structure in your data (that will be captured via PCA). You can find applications in the 2017 paper.
Hi all. My question is somewhat similar, so I didn't open a new topic.
I'm running RADseq analysis of a small rodent with samples throughout the distribution. I have 16 populations and 59 individuals, after previously filtering I got ~4K SNPs. PCA (PC1 and PC2) shows that populations are geographically structured, higher PCs also shows structured populations. I ran find.clusters (adegenet) and snmf (LEA) functions to explore clustering, getting K = 3, K = 16 (find.clusters) and K = 8 (snmf, which admits admixtured populations).
My question is about how many PCs to choose in pcadapt function. Although my VCF file was filtering used MAF = 0.05, I ran MAF = 0.1 in pcadapt to avoid detect outliers in only one population (I have populations distributed in three macroregions).
I think that values of 4, 7, 13 and 16 are acceptable.
For K = 4, I got 429 / 258 outliers (FDR = 0.05, without adjust/"holm" method)
For K=7, i got 337 / 179 outliers (same parameters)
K = 13, I got 443 / 267 outliers ,
and for K = 16, I got 627 / 429 outliers
Do you have any hints on choosing the value of K?
I also expect that distribution follow a hierarchical step in stones model, so I can have drift effect on SNPs related to local and I think that choosing a high value for K will increase confounding effects.
Finally, I also ran BayeScan analysis and only when considering K = 13 and K = 16 I found common SNPs in both methods (3 / 2 and 8 / 7, for K = 13 and K = 16 and without correction / "holm" method, respectively).
Thank you if anyone could help. Regards, Luiz Eduardo
Choosing K is easy. From the scree plot, I would probably choose 15. This should be a good enough starting value to look at PC scores.
Then, you need to look at the PC score around PC15, to see which are
Stop when you do not capture pop structure anymore.
Thank you for your Florian Privé, and sorry for the large pictures above.
I have investigated PC scores but signal is confusing, since for PC5 until PC10 populations start to grouping around center but still keeping individuals related. From PC13 and higher, the signal is lost.
Any way thank you again for helping, I will think more about your answer.
Regards, Luiz Eduardo
That's right. It is harder to tell with just a few individuals in each population.
Then, based on the screeplot again, I would recommend to use 6 if you want to play it safe, and 15 otherwise.
Thank you for your time and answers!
Hi, I have two questions about pcadapt.
One is that the Scree plot can determine the best choice of K. Can I output the optimal K value directly with R package because I want to call this parameter directly to the next program.
Another problem is that if the optimal k = 2, the difference between the two groups can be calculated, which is similar to the algorithm of FST.
But when the optimal k = 1, does it mean that pcadapt will not be used any more? It can not calculate the selection signal of single population like other software, such as Pi, Hp, CLR, etc.
When optimal k > 2, it means that there are three or more groups, so how to distinguish the calculated selection signals belong to which group?
Thank you in advance.
Weiwei Fu