Singular values & proportion of explained variance

bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4

https://bcm-uga.github.io/pcadapt

39 stars 10 forks source link

Singular values & proportion of explained variance #46

Closed YannDussert closed 4 years ago

YannDussert commented 4 years ago

Hi,

In the pcadapt documentation, the singular.values vector in a pcadapt object is described as "the vector containing the K ordered squared root of the proportion of variance explained by each PC". The values for the y-axis on the scree plot are also computed by "squaring" the singular values.

However, should the proportion of explained variance for the i-th PC not rather be computed as:

singular.values[i]^2 / sum(singular.values^2)

I am sorry if I missed something and if I am mistaken.

Best regards, Yann

privefl commented 4 years ago

In the code, we have https://github.com/bcm-uga/pcadapt/blob/307c7e218e77a55427dd61353ede4a26c9e2c90a/R/pcadapt.R#L246

where (nrow(obj.pca$u) - 1) * length(obj.pca$pass) should approximate the total variance of the scaled matrix.

Recall that only the first PCs are computed, so you can't sum over only the first values to get the total variance.

Does the result of cumsum(singular.values[i]^2) look weird to you?

YannDussert commented 4 years ago

Thank you for you reply.

I was surprised by the fact that the total proportion of explained variance was not 1, but after your reply, I realized that the function uses a truncated SVD, so, as you said, only the first K PCs are taken into account. Am I correct?

Best, Yann

privefl commented 4 years ago

Yes, only the first K PCs are used. For applications such as in human genetics, the first 10 PCs can explain only 10% of the total variance sometimes. Are you seeing something like this?

ldutoit commented 2 years ago

Hi,

I am having issues because our total explained variance is well above 1 (1.8 using singular values over the first 20 axes). Have you seen this before?

thumbnail_scree plot - K = 20

Do you know why it happens? I can't get my head around it.

thank you very much for your help!

Ludo

privefl commented 2 years ago

Did you forget to square the singular values?

ldutoit commented 2 years ago

Thanks!

No I did not. From the scree plot: The first few axes along sum up to over 1, and this is confirmed by the singular values:

data <- read.pcadapt("myfile.vcf", type = "vcf")
x <- pcadapt(input = data, K = 20)

##calculate EV
EV <- (x$singular.values^2)

EV2 <-EV*100

EV2
##EV2labels## "PC1, 39.2%", "PC2, 25.7%", "PC3, 21.5%", "PC4, 20.2%"

Very much appreciate the help (and the package).

Ludo

privefl commented 2 years ago

Weird indeed. What is the size of your data? Are you allowed to share it?

ldutoit commented 2 years ago

Hi Florian,

it is only 52k SNPs. I sent it by email to **.21@gmail.com. Hopefully it is the right address!

Ludo

privefl commented 2 years ago

library(pcadapt)
data <- read.pcadapt("tmp-data/out.recode.vcf", type = "vcf")
x <- pcadapt(input = data, K = 20)

X <- bed2matrix(data)
table(X[, x$pass], exclude = NULL)
#       0       1       2    <NA> 
# 1789596   35334  327475  140215 

X.scaled <- scale(X, center = 2 * x$af, scale = sqrt(2 * x$af * (1 - x$af)))
colMeans(X.scaled[, x$pass]^2, na.rm = TRUE)  # should be all ~1, but are ~2

It seems that your genotypes do not follow HWE at all (too few 1s). So that the total variance is approx twice of what the package approximates.

ldutoit commented 2 years ago

That makes sense.

Is the HWE assumption for the imputation or within the pca?

Would it be correct to take the total estimated variance over the first say 100 axes as the "total explained variance" and then look at each of those axes at a proportion of that variance, basically scaling it?

privefl commented 2 years ago

For the scaling used in the PCA (sqrt(2p(1-p))), it is supposed to give you variables that have variance 1 under HWE.

No, I don't think it would be correct. But usually, the singular values (or square of them) decrease almost linearly so that you can extrapolate the rest of them and therefore the total variance. Also, the total variance is just the sum of squared elements of the (scaled) matrix. So, basically, I think you could just use colMeans(X.scaled[, x$pass]^2, na.rm = TRUE) * nrow(X) as the total variance.

privefl commented 2 years ago

Ok, I think that should work. You can extrapolate the variance explained like this (using K = 100 as input):

y <- cumsum(x$singular.values^2)
plot(y, log = "")
y2 <- splinefun(seq_along(y), y, method = "monoH.FC")(101:(nrow(x$scores) - 1))
plot(c(y, y2))
EV <- x$singular.values^2 / max(y2)

ldutoit commented 2 years ago

Thank you so much, very grateful for your help.

That is awesome!

Interestingly, simply scaling using the total variance give relatively similar results.

See below for the first five axes:

 (x$singular.values**2/sum(x$singular.values**2))[1:5]
#0.19307577 0.12665384 0.10608894 0.09965733 0.07435843

EV[1:5]
#0.18669681 0.12246936 0.10258390 0.09636479 0.07190173

Thank you.

Ludo

privefl commented 7 months ago

There is now a better estimate of the total variance used in v4.4.