Can I use PCAadapt even if I don't have population structure?

marce-herrera commented 2 months ago

Hello,

I have a dataset consisting of 40 individuals, with 20 individuals from each of two putative populations. Preliminary analyses suggest that these two populations are not significantly different from each other. However, I am interested in determining whether it is still possible to identify any adaptive loci despite the lack of population structure. Would it make sense to pursue this analysis, and if so, can this be achieved using PCAadapt?

Thank you!

privefl commented 2 months ago

What do you see on the first two PCs?

marce-herrera commented 2 months ago

Hi,

Thank you for getting back so fast!

As you can see below, there's no structure..

Screen Shot 2024-06-24 at 15 32 01

This is the screeplot: Screen Shot 2024-06-24 at 15 33 16

privefl commented 2 months ago

Then I guess you won't get much from pcadapt. Is the Manhattan plot flat?

marce-herrera commented 2 months ago

Hi there,

See the plots below for k = 10 Screen Shot 2024-06-24 at 16 18 07

Screen Shot 2024-06-24 at 16 39 59

Thanks!

privefl commented 2 months ago

The PCA here probably captures LD structure only; you need to use some clumping.

marce-herrera commented 2 months ago

Hello,

I did the LD clumping (as shown in the tutorial) and this is what I get:

Screen Shot 2024-06-24 at 17 44 40

This is for k=10 (same as before): Screen Shot 2024-06-24 at 17 44 22

I'm not really sure how to interpret this. Any advice?

Thank you!

privefl commented 2 months ago

I would interpret this as: you need even more clumping (smaller r2 threshold). If you still get something similar, I think you can give up.

marce-herrera commented 2 months ago

Hi,

So this time I used LD.clumping = list(size = 200, thr = 0.1)

Do you think doing size=500 or size=1000 will make a difference? by smaller r2 threshold you mean ...? thr = 0.05, thr = 0.01?

Thanks!

privefl commented 2 months ago

It seems you have very long-range LD regions. You can try e.g. size = 1e5 and thr = 0.02.

marce-herrera commented 2 months ago

Hello,

Here's an update:

After reviewing my progress, I realized that I had not performed LD pruning on my SNP dataset. I went back and completed this step using PLINK.

Afterward, I ran PCAdapt and obtained the following results:

x <- pcadapt(input = filename, K = 20) plot(x, option = "screeplot") Screen Shot 2024-06-25 at 16 12 14

poplist.names <- c(rep("lagoon", 20), rep("ocean", 20)) plot(x, option = "scores", pop = poplist.names) Screen Shot 2024-06-25 at 16 12 54

par(mfrow = c(2, 2)) for (i in 1:4) plot(x$loadings[, i], pch = 19, cex = .3, ylab = paste0("Loadings PC", i)) Screen Shot 2024-06-25 at 16 13 52

Based on the first screeplot I chose k = 5: x <- pcadapt(filename, K = 5) plot(x , option = "manhattan") Screen Shot 2024-06-25 at 16 14 44 (You can still see a peak, but it's much better than it was before I did the pruning.)

plot(x , option = "qqplot") Screen Shot 2024-06-25 at 16 15 42

hist(x$pvalues, xlab = "p-values", main = NULL, breaks = 50, col = "orange") Screen Shot 2024-06-25 at 16 16 14

plot(x, option = "stat.distribution") Screen Shot 2024-06-25 at 16 16 53

Because I still have high LD I did LD clumping as follows: newx <- pcadapt(filename, K = 20, LD.clumping = list(size = 200, thr = 0.1)) plot(newx, option = "screeplot") Screen Shot 2024-06-25 at 16 18 04

newx <- pcadapt(filename, K = 5, LD.clumping = list(size = 200, thr = 0.1))

par(mfrow = c(2, 2)) for (i in 1:4) plot(newx$loadings[, i], pch = 19, cex = .3, ylab = paste0("Loadings PC", i)) Screen Shot 2024-06-25 at 16 18 47

plot(newx , option = "manhattan") Screen Shot 2024-06-25 at 16 19 26

plot(newx , option = "qqplot") Screen Shot 2024-06-25 at 16 19 49

hist(newx$pvalues, xlab = "p-values", main = NULL, breaks = 50, col = "orange") Screen Shot 2024-06-25 at 16 20 25

plot(newx, option = "stat.distribution") Screen Shot 2024-06-25 at 16 20 35

I then get ~1200 outliers: padj <- p.adjust(newx$pvalues,method="BH") alpha <- 0.05 bh_outliers <- which(padj < alpha) length(bh_outliers) [1] 1216

Do you think this approach is correct? Am I right in understanding that these 1,216 outliers are adaptive loci? What does this imply even if I don't initially observe population structure? How can I retrieve these outliers? I am interested in analyzing them further with SNP effect to see the functional effect...

Thank you again!!

privefl commented 2 months ago

No, I'm afraid you're still capturing only LD. There is no visible population structure.

marce-herrera commented 2 months ago

Right, okay. Just to clarify, is PCAdapt only useful if there is population structure? If so, I cannot use it with this dataset... correct?

privefl commented 2 months ago

Yes x2

marce-herrera commented 2 months ago

Ok, thank you!

One last question (sorry if it seems simple, I'm new to population genetics). Can you recommend other tools for detecting outliers? Is it worth pursuing this if I don't have population structure data? My reasoning is that I might not have enough differences to detect population structure, but I could still find differences at certain loci, which I would be interested in exploring.

Thanks again!

privefl commented 2 months ago

I'm not really an expert in popgen either, sorry.

bcm-uga / pcadapt

Can I use PCAadapt even if I don't have population structure? #91