claczny / VizBin

Repository of our application for human-augmented binning
27 stars 14 forks source link

Getting OutOfMemory error despite bigger heap size #4

Closed claczny closed 9 years ago

claczny commented 9 years ago

On a dataset coprising around 70k sequences, I got the following error, despite using java -jar -Xmx3g VizBin-dist.jar

java.lang.OutOfMemoryError: Java heap space
    at org.ejml.data.DenseMatrix64F.<init>(Unknown Source)
    at org.ejml.alg.dense.decomposition.svd.SvdImplicitQrDecompose.getW(Unknown Source)
    at org.ejml.alg.dense.decomposition.svd.SvdImplicitQrDecompose.getW(Unknown Source)
    at lcsb.vizbin.service.utils.PrincipleComponentAnalysis.computeBasis(PrincipleComponentAnalysis.java:137)
    at lcsb.vizbin.service.utils.DataSetUtils.computePca(DataSetUtils.java:260)
    at lu.uni.lcsb.vizbin.ProcessInput$2.run(ProcessInput.java:153)

Suggested solution: Implement a different way of computing the initial PCA-based dimension reduction. Maybe using something along the lines of http://math.nist.gov/javanumerics/jama/ and http://sujitpal.blogspot.com/2008/10/ir-math-in-java-cluster-visualization.html

[EDIT] See also http://stackoverflow.com/questions/529457/performance-of-java-matrix-math-libraries for other alternatives, in particular JBLAS.

claczny commented 9 years ago

Use the following MATLAB code as "prototype":

function [SCORES,PC,VAR] = pca_through_svd(data)
% pca_through_svd: Perform PCA using SVD.
% Motivated by http://arxiv.org/abs/1404.1100
% data - MxN matrix of input data
%        (M trials, N dimensions)
% SCORES - MxN matrix of projected data
% PC - each column is a Principal Component
% VAR - Mx1 matrix of variances
[M,N] = size(data)
% subtract off the mean for each dimension
mn = mean(data,1)
data = data - repmat(mn,M,1)
% construct the matrix Y
Y = data / sqrt(N-1)
% SVD does it all
[U,S,V] = svd(Y)
% calculate the variances
S = diag(S)
VAR = S .* S
% project the original data
SCORES = data * V;
PC=V;

and then use only the desired number of dimensions from SCORES, e.g., 30: SCORES(:, 0:29).

claczny commented 9 years ago

Since we are only interested in the principal components (eigenvectors of the covariance(X) ) but not in the eigenvalues or such, it should be enough to compute V and actually, there might also be the possibility to compute only V_red = V(:, 0:29). The transformed data would then basically just have to be multiplied by V_red.

[EDIT] Even MATLAB has huge difficulties (meaning it uses enormous amounts of memory, even going into swap) when trying to compute the SVD of a 140k-by-512 matrix (says it requires around 500MB only for the matrix itself). The above mentioned 70k entries are maybe possible but the bigger we can go, the better. Accordingly, using svds seems to be the way-to-go. Specifically, using svds(X, 30) worked well and fast. This strongly suggests we should use a library that is able to compute only a certain set of columns of V instead of the full V.

[EDIT2] It seems that the problem (at least in MATLAB) comes from the fact that the matrix U in the SVD becomes huge when one does not account for this in some way. In fact, for X = U * S * V,where X is an n-by-m matrix, U will be a n-by-n matrix. Assuming n = 140k (or even 70k), Uwill be huge. The matrix V on the other hand is a m-by-m matrix which is not a problem in our case (for 5mers, m is 512).

claczny commented 9 years ago

Looking at http://ojalgo.org/performance_ejml.html -> "Singular Value and Eigenvalue (symmetric matrices) Decompositions" shows that pretty much all of the libraries tested there seem to be equally "slow" for large matrices. Thus it might be interesting to look into "native code" ("f you download the full/raw benchmark results mentioned above they include results for libraries using native code. When everything is to native code's advantage it can easily be 10x faster").

claczny commented 9 years ago

We integrated an improvement by implementing MTJ and computing only partial singular vectors (partial in the sense that we limit the computation to the num_components-most "informative" components. This allows now to run the PCA (via SVD) on up to 93k sequences (simLC+ @ 950nt; just for testing purposes, no other scientific meaning in this value choice) with a heap size of 3GB, s.a. #8 and #9. Further improvements might be feasible but this issue will be closed for now.