Low explained Variance after LD clumping

bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4

https://bcm-uga.github.io/pcadapt

39 stars 10 forks source link

Low explained Variance after LD clumping #87

Closed mehakmadhura closed 7 months ago

mehakmadhura commented 7 months ago

Hii, I had a question about LD clumping. I observed that in a PCA done with LD clumping, the percentage of variance explained by the PCs is much lower than that explained by PCs without clumping. Can you please explain the reason?

privefl commented 7 months ago

Could you give actual numbers and plots

mehakmadhura commented 7 months ago

Hii! Thank you for your reply. Here are the scree plots with and without clumping. screeplot_un_clump1000 screeplot_un_ld_prunned

The proportion of explained variance is lower in the first plot with clumping.

privefl commented 7 months ago

I guess we expect some minor drop in variance explained, just because there are less variables used. However, I do not think we expect such a large drop indeed, especially if population structure is captured.

What is the proportion of variables removed/kept after clumping?
Could you show the first two PC scores in both cases

mehakmadhura commented 7 months ago

I think it keeps around 0.6 of the variants. Here are the PC scores for clumping and no clumping scores12_un_k3_clump1000 scores12_un_ld_prunned

Also, when we decide on a k, for example, k=3. does pcadapat perform a PCA again with k-3, or just select the first 3 pcs for computation?

privefl commented 7 months ago

It keeps 60% of variants? Or 0.6%?
The PC scores seem highly similar, which is expected. So I don't get why there is such a large difference in the proportion of variance explained; it may simply due to an error in the way it is computed.
pcadapt does the computation from scratch every time, bu it should be quite fast.

mehakmadhura commented 7 months ago

Thank you for your replies. It keeps 60% of the variants. Also, does it show such a decrease in the variance explained for other datasets as well?

privefl commented 7 months ago

These are the results from the tutorial:

unnamed-chunk-17-1

unnamed-chunk-20-1

So, I guess, yes.

I'll try to see if there is better way to estimate these.

mehakmadhura commented 7 months ago

Thank you so much!

privefl commented 7 months ago

I've implemented a better estimate of the total variance to get better estimates of the proportions of explained variance. But I still get very similar results for the tutorial.

Can you try the latest GitHub version on your data?

mehakmadhura commented 7 months ago

Hii! Thanks for the update. I tried the new version on my dataset. I have attached the screeplots with and without clumping. test_french_clump1000_new4 4 test_french_noclump_new4 4

privefl commented 7 months ago

What do you get for sum(!is.na(x$loadings[, 1])) / length(x$pass)?

mehakmadhura commented 7 months ago

0.03581412 with clumping and 1.050497 without clumping. Also, this is with k=10. I didn't choose for a K yet.

privefl commented 7 months ago

Okay, this should be the percentage of variants (that passed the MAF threshold) kept after clumping. So this is a very small percentage that is kept. This may explain the results. I would have expected you would get 1 without the clumping however.

mehakmadhura commented 7 months ago

Oh! So is the fraction of explained variance by the PCs with respect to the variance of whole data, and not with respect to just the subset left after clumping?

privefl commented 7 months ago

Yes, the total variance is computed from the data after the MAF threshold.

mehakmadhura commented 7 months ago

Okay. Thank you so much for your prompt replies!

privefl commented 7 months ago

Are you happy with this? Should we close this issue?

mehakmadhura commented 7 months ago

Yes Sure! Thank you so much.