memory usage problems - Githubissues

andland / generalizedPCA

Dimensionality reduction for exponential family data by extending PCA

Other

23 stars 2 forks source link

memory usage problems #13

Closed evilrobotxoxo closed 8 years ago

evilrobotxoxo commented 8 years ago

hello,

I've just started using this package to analyze neurophysiological recordings, and it works great. My only issue is that I have a lot of time points, and if I run generalized PCA on an N x M matrix (with N>>M), somewhere in the code it attempts to allocate an N x N double precision matrix. This is what I assume because I get errors saying that the code tried to allocate a matrix of 251.3 GB in size, for example, and that's exactly the size an NxN double matrix would be. This is a real problem because I can't analyze an actual full dataset as a result. I looked through the code a little but was unable to find where this occurs. Is this something that could be mitigated in some way, for example by using a sparse matrix, or int/float instead of double? My data is count data, and it's very sparse.

thanks!

evilrobotxoxo commented 8 years ago

Due to a bug in my test code, I thought it was a more complicated problem that it really was. The problem is that there are three instances where diag(W)%% is used, and those just have to be replaced with the R equivalent of repmat(W, 1, d) .

This should be computationally faster as well. Problem solved!

andland commented 8 years ago

Thank you for your interest in the package. I looked into your suggestion, and you are right. When the number of rows is much greater than the number of columns, using diag(W) is inefficient. When the number of rows and columns is comparable, then there is not much of a gain, however. As you indicate, the code can be updated by changing t(eta_centered) %*% diag(W) to t(eta_centered * W). I will make this change in the near future.

When the data is sparse, I also have an idea of how to avoid creating dense matrices in each iteration, by solving the eigen problem only calling inner products. This would greatly reduce the memory requirement, but the flops would be about the same. But this is a bigger undertaking. For now, this code is more of a proof of concept and meant to work with small/medium data.

evilrobotxoxo commented 8 years ago

Thanks for your reply. With the minor diag(W) tweak, this code worked well with some pretty big datasets I had, without requiring sparse matrices. I think this is very cool work, and thanks for putting it out in a user-friendly package.