JEFworks / MUDAN

Multi-sample Unified Discriminant ANalysis
http://jef.works/MUDAN/
GNU General Public License v3.0
72 stars 12 forks source link

PC calculation #1

Open ahy1221 opened 6 years ago

ahy1221 commented 6 years ago

Hi, I was wondering that why the PC calculation for the expression matrix in the MUDAN is a <- irlba::irlba(crossprod(m)/(nrow(m)-1), nu=0, nv=nPcs, tol=tol, ...) a$l <- m %*% a$v m <- t(a$l) If I understand right, in the Pagoda2 or the Seurat setting, the PC calculation is just a SVD decomposition on the expression matrix using irlba package to get the U matrix. What is the difference between MUDAN's PCs and others and is there any idea in that ? Thanks for let me know.

Yao

JEFworks commented 6 years ago

Hi Yao,

As you can tell from the lack of documentation, MUDAN is still in development (and a long ways away from publication). So I'm impressed by your willingness to go through the code.

The PC calculation is nothing new/innovative. As you can see from this very nice StackOverflow post (https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca), X or m is the centered input, for which we compute the covariance matrix C = X'X/(n−1) or crossprod(m)/(nrow(m)-1), and we can then use irlba to perform SVD to get eigenvectors a$v. The PCs are then XV or m %*% a$v.

You can check that MUDAN's fastPCA and regular prcomp gives approximately the same results with substantially improved runtime:

library(MUDAN)                                                                                                                                                                                                             

set.seed(0)                                                                                                                                                                                                                
mat <- matrix(rnorm(1000*1000),1000,1000)                                                                                                                                                                                  
rownames(mat) <- paste0('row', seq_len(nrow(mat)))                                                                                                                                                                         
colnames(mat) <- paste0('col', seq_len(ncol(mat)))                                                                                                                                                                         

## first try fastPca                                                                                                                                                                                                       
start_time <- Sys.time()                                                                                                                                                                                                   
nPcs <- 2                                                                                                                                                                                                                  
pcs <- MUDAN::fastPca(t(mat), nPcs)                                                                                                                                                                                        
m <- pcs$l                                                                                                                                                                                                                 
colnames(m) <- paste0('PC', seq_len(nPcs))                                                                                                                                                                                 
test1 <- m                                                                                                                                                                                                                 
end_time <- Sys.time()                                                                                                                                                                                                     
end_time - start_time                                                                                                                                                                                                      

## now try normal prcomp                                                                                                                                                                                                   
start_time <- Sys.time()                                                                                                                                                                                                   
pcs <- prcomp(t(mat))                                                                                                                                                                                                      
test2 <- pcs$x[, seq_len(nPcs)]                                                                                                                                                                                            
end_time <- Sys.time()                                                                                                                                                                                                     
end_time - start_time                                                                                                                                                                                                      

## compare                                                                                                                                                                                                                 
par(mfrow=c(1,2))                                                                                                                                                                                                          
plot(test1[,1], test2[,1])                                                                                                                                                                                                 
plot(test1[,2], test2[,2])    

Time difference of 0.3974726 secs for fastPca Time difference of 2.288288 secs for prcomp

screen shot 2017-11-05 at 3 14 53 pm

Best, Jean

ahy1221 commented 6 years ago

Sorry for the late reply, thank you very much ! It is helpful and learn a lot ! :)

julirsch commented 6 years ago

Hi Jean! I was researching a somewhat related PCA question and stumbled upon this. Thanks for the awesome stack exchange link.

My understanding is that for a centered sample by gene matrix (X) you can do the following to identify the V matrix from which you can derive the sample by PC matrix (XV):

1) You can compute and diagonalize the covariance matrix (X(t)X/(n-1)) to identify V. 2) You can compute SVD (or partial SVD) on X to identify V (or a partial V).

In the code Yao highlighted from your package (which looks really cool!), it looks like you are performing SVD on the covariance matrix and not X itself, so your resulting PC matrix (XV) will be the sample by PC matrix for the covariance and not for X itself (although I'm not 100% sure this is right, since it looks like you multiply the original X by the V for the X(t)X/(n-1). Perhaps this is your intention, but if I'm reading your code right, in your comparison it looks like you're comparing the PC matrix for the covariance calculated by MUDAN::fastPCA to the PC matrix for X itself calculated by prcomp. (I had also always thought that one reason people estimated partial SVD was to avoid calculating the full covariance matrix for big X matrices).

Sorry to be a bother, but I just couldn't square all of this with myself and thought I would post!

Best, Jacob

JEFworks commented 6 years ago

Hi Jacob!

Yes, your understanding is correct.

it looks like you are performing SVD on the covariance matrix

Correct. Please look through the equations of the Stack Overflow post again to better understand why SVD on either X and C will be able to find V. The good thing with math is that you can always sanity check by computing an example. You can show that the V derived from either SVDs (when X is centered) are equivalent and will ultimately produce the same PCs.

set.seed(0) 
mat <- matrix(rnorm(1000*500),1000,500) 
rownames(mat) <- paste0('row', seq_len(nrow(mat))) 
colnames(mat) <- paste0('col', seq_len(ncol(mat))) 
## center
m <- t(mat)
m <- scale(m, scale=FALSE, center=TRUE)
nPcs <- 1

## svd on X = USV'
foo <- irlba::irlba(m)
foo.v <- foo$v[,1:nPcs]
## svd on C = X'X/(n-1) = VLV'
bar <- irlba::irlba(crossprod(m)/(nrow(m)-1), nu=0, nv=nPcs)
bar.v <- bar$v[,1:nPcs]
## built in
pcs.info <- prcomp(m, scale=FALSE, center=FALSE)
pcs.v <- pcs.info$rotation[,1:nPcs] ## variable loadings from prcomp

## compare v
all.equal(foo.v, bar.v)
all.equal(foo.v, pcs.v)

## multiple by X to get PCs
pcs.foo <- m %*% foo.v
pcs.bar <- m %*% bar.v
pcs <- pcs.info$x[, 1:nPcs] ## data multiplied by the rotation matrix from prcomp

I had also always thought that one reason people estimated partial SVD was to avoid calculating the full covariance matrix for big X matrices

Hum, I think you are right about this. I originally read that for m>n (more genes than cells for example, which is what we typically deal with) that decomposition of the covariance function was more efficient. But I just benchmarked that SVD on X is indeed faster than on the covariance matrix with the built-in PBMC data.

If you'd be interested in making a fork, correcting the fastPCA method to using X instead of the covariance, and making a pull request, I'd be happy to merge so you can get credit for your contribution. I'd also appreciate if you could run this following test and include the results in your commit message to ensure that results don't change and is indeed faster:

data(pbmcA) ## real data
cd <- as.matrix(pbmcA)
mat <- normalizeCounts(cd)
matnorm.info <- normalizeVariance(mat, details=TRUE)
matnorm <- log10(matnorm.info$mat[matnorm.info$ods,]+1)

start_time <- Sys.time() 
pcs <- prcomp(t(matnorm)) ## built in
end_time <- Sys.time()  
end_time - start_time 
start_time <- Sys.time() 
pcs1 <- MUDAN::fastPca(t(matnorm), 1) ## old
end_time <- Sys.time()  
end_time - start_time 
start_time <- Sys.time() 
pcs2 <- fastPca(t(matnorm), 1) ## yours
end_time <- Sys.time()  
end_time - start_time 

plot(pcs1$l[,1], pcs2$l[,1])
## or
all.equal(pcs1$l[,1], pcs2$l[,1])
## yada yada you get the idea

Let me know if you're interested.

Thanks Jacob!

Best, Jean

julirsch commented 6 years ago

Definitely!

Jacob