Reduce memory consumption for PCA when float32 matrix is specified.

facebookarchive / fbpca

Fast Randomized PCA/SVD

Other

490 stars 94 forks source link

Reduce memory consumption for PCA when float32 matrix is specified. #5

Closed vortexdev closed 6 years ago

vortexdev commented 6 years ago

Some numpy functions like "numpy.random.uniform" and "numpy.ones" use float64 by default. If float32 matrix is specified for PCA, then numpy operations between float32 and float64 matrices create additional matrices which dramatically increase RAM usage.

Float32 allows to fit into memory much bigger matrices, and float32 precision is enough for many use-cases.

tygert commented 6 years ago

Might you want to standardize on the randomized PCA in scikit-learn, documented at http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html with svd_solver='randomized'? The modifications proposed in your pull request lead to funny behavior when the entries of the input matrices (A) are integers, with dtype='int64', for example. The implementation in scikit-learn should match the performance of fbpca in most cases, as detailed at https://github.com/scikit-learn/scikit-learn/pull/5141

vortexdev commented 6 years ago

Hi Mark,

We don't have issues with performance, we are just trying to reduce memory usage (to use cheaper instances).

Yes, we tried sklearn PCA before fbpca, but looks like it also has issues with memory consumption on transformation step (even with copy=False and fit_transform). Sklearn (scikit-learn==0.19.1) implementation uses ~2*size(dataset) memory, while fbpca uses ~size(dataset). Perhaps this is a known issue or the expected behaviour? (maybe you can advise something)

But, you are right about the use-case with int64, I have to fix that... Thanks.

Thank you.

tygert commented 6 years ago

You make a very interesting point! The memory usage could be a legitimate issue to raise with the scikit-learn team? I worry a bit about a proliferation of implementations, and was hoping that sklearn could be the standard (but perhaps that is a forlorn hope?).

vortexdev commented 6 years ago

We have tight deadlines for our project, so we decided to use fbpca with our fix...

But I think it makes sense to raise an issue for scikit-learn project. I will perform more tests and will back to this topic to create a github issue for scikit-learn team.

vortexdev commented 6 years ago

Hi Mark,

I've updated PR. What do you think about this solution?

FYI: I've created the issue for scikit-learn project: https://github.com/scikit-learn/scikit-learn/issues/11102.

tygert commented 6 years ago

Thanks for your improvements! We incorporated the gist of your propositions in the new version.