Closed YannDussert closed 4 years ago
In the code, we have https://github.com/bcm-uga/pcadapt/blob/307c7e218e77a55427dd61353ede4a26c9e2c90a/R/pcadapt.R#L246
where (nrow(obj.pca$u) - 1) * length(obj.pca$pass)
should approximate the total variance of the scaled matrix.
Recall that only the first PCs are computed, so you can't sum over only the first values to get the total variance.
Does the result of cumsum(singular.values[i]^2)
look weird to you?
Thank you for you reply.
I was surprised by the fact that the total proportion of explained variance was not 1, but after your reply, I realized that the function uses a truncated SVD, so, as you said, only the first K PCs are taken into account. Am I correct?
Best, Yann
Yes, only the first K PCs are used. For applications such as in human genetics, the first 10 PCs can explain only 10% of the total variance sometimes. Are you seeing something like this?
Hi,
I am having issues because our total explained variance is well above 1 (1.8 using singular values over the first 20 axes). Have you seen this before?
Do you know why it happens? I can't get my head around it.
thank you very much for your help!
Ludo
Did you forget to square the singular values?
Thanks!
No I did not. From the scree plot: The first few axes along sum up to over 1, and this is confirmed by the singular values:
data <- read.pcadapt("myfile.vcf", type = "vcf")
x <- pcadapt(input = data, K = 20)
##calculate EV
EV <- (x$singular.values^2)
EV2 <-EV*100
EV2
##EV2labels## "PC1, 39.2%", "PC2, 25.7%", "PC3, 21.5%", "PC4, 20.2%"
Very much appreciate the help (and the package).
Ludo
Weird indeed. What is the size of your data? Are you allowed to share it?
Hi Florian,
it is only 52k SNPs. I sent it by email to **.21@gmail.com. Hopefully it is the right address!
Ludo
library(pcadapt)
data <- read.pcadapt("tmp-data/out.recode.vcf", type = "vcf")
x <- pcadapt(input = data, K = 20)
X <- bed2matrix(data)
table(X[, x$pass], exclude = NULL)
# 0 1 2 <NA>
# 1789596 35334 327475 140215
X.scaled <- scale(X, center = 2 * x$af, scale = sqrt(2 * x$af * (1 - x$af)))
colMeans(X.scaled[, x$pass]^2, na.rm = TRUE) # should be all ~1, but are ~2
It seems that your genotypes do not follow HWE at all (too few 1s). So that the total variance is approx twice of what the package approximates.
That makes sense.
Is the HWE assumption for the imputation or within the pca?
Would it be correct to take the total estimated variance over the first say 100 axes as the "total explained variance" and then look at each of those axes at a proportion of that variance, basically scaling it?
For the scaling used in the PCA (sqrt(2p(1-p))), it is supposed to give you variables that have variance 1 under HWE.
No, I don't think it would be correct. But usually, the singular values (or square of them) decrease almost linearly so that you can extrapolate the rest of them and therefore the total variance.
Also, the total variance is just the sum of squared elements of the (scaled) matrix.
So, basically, I think you could just use colMeans(X.scaled[, x$pass]^2, na.rm = TRUE) * nrow(X)
as the total variance.
Ok, I think that should work. You can extrapolate the variance explained like this (using K = 100 as input):
y <- cumsum(x$singular.values^2)
plot(y, log = "")
y2 <- splinefun(seq_along(y), y, method = "monoH.FC")(101:(nrow(x$scores) - 1))
plot(c(y, y2))
EV <- x$singular.values^2 / max(y2)
Thank you so much, very grateful for your help.
That is awesome!
Interestingly, simply scaling using the total variance give relatively similar results.
See below for the first five axes:
(x$singular.values**2/sum(x$singular.values**2))[1:5]
#0.19307577 0.12665384 0.10608894 0.09965733 0.07435843
EV[1:5]
#0.18669681 0.12246936 0.10258390 0.09636479 0.07190173
Thank you.
Ludo
There is now a better estimate of the total variance used in v4.4.
Hi,
In the pcadapt documentation, the singular.values vector in a pcadapt object is described as "the vector containing the K ordered squared root of the proportion of variance explained by each PC". The values for the y-axis on the scree plot are also computed by "squaring" the singular values.
However, should the proportion of explained variance for the i-th PC not rather be computed as:
?
I am sorry if I missed something and if I am mistaken.
Best regards, Yann