bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4
https://bcm-uga.github.io/pcadapt
39 stars 10 forks source link

How to obtain details of PCA drown by pcadapt #52

Closed Mary-00 closed 4 years ago

Mary-00 commented 4 years ago

Dear authors and all, I have drawn PCA using pcadapt package to show the population structure based on a list of SNPs. However, a review asked the details of PCA should be presented as a supplementary file. Could you please kindly help me out how I can obtain such details?

Many thanks in advance

privefl commented 4 years ago

What details are you referring to? The procedure? The algorithm? The results (scree/PC plots)?

Mary-00 commented 4 years ago

I'm not sure which kind of details the reviewer means, but I think it's related to the score plot, something like loadings and scores, isn't it?

privefl commented 4 years ago

You have some details about the method in the new short paper: https://doi.org/10.1093/molbev/msaa053. You can always provide the scree plot and some score plots. There are functions to create these figures in the package.

Mary-00 commented 4 years ago

Dear Florian, Thank you, however, I’m a bit confused. I used the below commands for drawing score plot using pcadapt:

vcf2pcadapt("file1.vcf", output = "file2.pcadapt", allele.sep = c("/", "|"))
geno_pcadapt <- read.pcadapt ("file2.pcadapt", type="pcadapt")
x <- pcadapt (input = geno_pcadapt, K = 5)
plot (x, option = "screeplot")
par(mfrow = c(1, 2))
res <- pcadapt (geno_pcadapt, K = 5, LD.clumping = list(size=200, thr = 0.1))
plot(res, option=”scores”)
sing.value <- x$singular.values
sing.value
0.3269 0.2674 0.2456 0.2244 0.2103

Loadings <- res$loadings

Here, “singular.values” derived from scree plot is the percentage of variance explained by each PC, yes?

Regarding loadings value derived from the score plot, there were correlation values between just 220 (from 2479) marker and each PC, in fact, there was NA for the rest of markers (2259). Could you please let me know why there is NA for most of the markers (2259)? If they were removed after LD.clumping?

Finally, I’m going to present the loading value as details for the supplementary file as the reviewer requested details, is it OK in your idea? Please kindly share me any your suggestion for considering as details of PCA plot.

Thank you very much in advance

privefl commented 4 years ago

The proportion of variance explained is approx sing.value^2 (cf. https://github.com/bcm-uga/pcadapt/issues/46).

Is it normal that you have so few variants? We have talked about this already, right?

The NAs are probably some filtering on MAF. Clumping should not affect which variants are reported.

Mary-00 commented 4 years ago

Thank you very much. Right, we have talked about the outlier detection and few variants, before. Here, I'm talking just on the score plot obtained by pcadapt. I checked the maf of variants, just 185 variants had maf < 0.05, so other variants with maf > 0.05 should be considered. However, there was NA for 2259 variants in the correlation matrix (loadings), so what's happened?

privefl commented 4 years ago

Sorry, I was not talking about the same thing as you did.

Loadings have NAs for variants removed from clumping (it is basically only the variants used in the PCA). But, in the matrix of Z-scores, there should be NAs only for the low MAF.

Mary-00 commented 4 years ago

Many thanks for the clarification. Sorry, just one thing, is this file (loadings) enough as details of PCA for the supplementary in your opinion? please kindly let me know if you have any suggestions?

Thanks

privefl commented 4 years ago

As said before, I would probably show the scree plot (to justify choice of K) and the PCs scores (to show that you're capturing some pop structure).

But you should probably ask your supervisor about this.

Mary-00 commented 4 years ago

Many thanks for your kind responses.