Questions about K values

weiwFu commented 3 years ago

Hi, I have two questions about pcadapt.

One is that the Scree plot can determine the best choice of K. Can I output the optimal K value directly with R package because I want to call this parameter directly to the next program.

Another problem is that if the optimal k = 2, the difference between the two groups can be calculated, which is similar to the algorithm of FST.

But when the optimal k = 1, does it mean that pcadapt will not be used any more? It can not calculate the selection signal of single population like other software, such as Pi, Hp, CLR, etc.

When optimal k > 2, it means that there are three or more groups, so how to distinguish the calculated selection signals belong to which group?

Thank you in advance.

Weiwei Fu

privefl commented 3 years ago

What do you mean by "the best choice of K from the R package"? This choice is made by the user, based on looking at the screeplot and PC scores.

I guess it is more the case K=1 that corresponds to Fst. Because with K PCs, you can separate K+1 populations.

pcadapt is really useful when K>1 (i.e. there are more than 2 populations) I'm not sure this is designed to look at each pair of population (although you could do this by first filtering the bed/bim/fam files with e.g. PLINK and running each population pairs separetely -> I think this was discussed in another issue here).

weiwFu commented 3 years ago

Thank you very much for your reply. I want to get the optimal k directly. This is very useful for me to do batch operations and put this parameter directly to the next program. Whether pcadapt can directly output the optimal K value to me, not in a graphic way?

Also thank you for your reply to the selection signal. If I have one group, pcadapt would no longer apply?

If I have more than 2 populations, how do I distinguish which group (or more than one) is under selection.

If possible, can you give an example of the application of multi group selection signal, which will be very important and helpful for me to understand this software.

privefl commented 3 years ago

Estimating K programmatically can be tricky.

To apply pcadapt, you need population structure in your data (that will be captured via PCA). You can find applications in the 2017 paper.

ribaslej commented 3 years ago

Hi all. My question is somewhat similar, so I didn't open a new topic.

I'm running RADseq analysis of a small rodent with samples throughout the distribution. I have 16 populations and 59 individuals, after previously filtering I got ~4K SNPs. PCA (PC1 and PC2) shows that populations are geographically structured, higher PCs also shows structured populations. I ran find.clusters (adegenet) and snmf (LEA) functions to explore clustering, getting K = 3, K = 16 (find.clusters) and K = 8 (snmf, which admits admixtured populations).

My question is about how many PCs to choose in pcadapt function. Although my VCF file was filtering used MAF = 0.05, I ran MAF = 0.1 in pcadapt to avoid detect outliers in only one population (I have populations distributed in three macroregions).

I think that values of 4, 7, 13 and 16 are acceptable. PCA screeplot

For K = 4, I got 429 / 258 outliers (FDR = 0.05, without adjust/"holm" method) qqplot K4 stats distrib K4 manhattan plot K4

For K=7, i got 337 / 179 outliers (same parameters) qqplot K7 stats distrib K7 manhattan plot K7

K = 13, I got 443 / 267 outliers stats distrib K13 manhattan plot K13 , qqplot K13

and for K = 16, I got 627 / 429 outliers qqplot K16 stat distrib K16 manhattan plot K16

Do you have any hints on choosing the value of K?

I also expect that distribution follow a hierarchical step in stones model, so I can have drift effect on SNPs related to local and I think that choosing a high value for K will increase confounding effects.

Finally, I also ran BayeScan analysis and only when considering K = 13 and K = 16 I found common SNPs in both methods (3 / 2 and 8 / 7, for K = 13 and K = 16 and without correction / "holm" method, respectively).

Thank you if anyone could help. Regards, Luiz Eduardo

privefl commented 3 years ago

Choosing K is easy. From the scree plot, I would probably choose 15. This should be a good enough starting value to look at PC scores.

Then, you need to look at the PC score around PC15, to see which are

capturing population structure
capturing artifacts such as LD
just noise (just a circle)

Stop when you do not capture pop structure anymore.

ribaslej commented 3 years ago

Thank you for your Florian Privé, and sorry for the large pictures above.

I have investigated PC scores but signal is confusing, since for PC5 until PC10 populations start to grouping around center but still keeping individuals related. From PC13 and higher, the signal is lost.

Any way thank you again for helping, I will think more about your answer.

Regards, Luiz Eduardo

PC5xPC6 PC7xPC8 PC9xPC10 PC11xPC12 PC13xPC14 PC15xPC16

privefl commented 3 years ago

That's right. It is harder to tell with just a few individuals in each population.

Then, based on the screeplot again, I would recommend to use 6 if you want to play it safe, and 15 otherwise.

ribaslej commented 3 years ago

Thank you for your time and answers!

bcm-uga / pcadapt

Questions about K values #63