Closed vortexdev closed 6 years ago
Might you want to standardize on the randomized PCA in scikit-learn, documented at http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html with svd_solver='randomized'? The modifications proposed in your pull request lead to funny behavior when the entries of the input matrices (A) are integers, with dtype='int64', for example. The implementation in scikit-learn should match the performance of fbpca in most cases, as detailed at https://github.com/scikit-learn/scikit-learn/pull/5141
Hi Mark,
We don't have issues with performance, we are just trying to reduce memory usage (to use cheaper instances).
Yes, we tried sklearn PCA before fbpca, but looks like it also has issues with memory consumption on transformation step (even with copy=False and fit_transform). Sklearn (scikit-learn==0.19.1) implementation uses ~2*size(dataset) memory, while fbpca uses ~size(dataset). Perhaps this is a known issue or the expected behaviour? (maybe you can advise something)
But, you are right about the use-case with int64, I have to fix that... Thanks.
Thank you.
You make a very interesting point! The memory usage could be a legitimate issue to raise with the scikit-learn team? I worry a bit about a proliferation of implementations, and was hoping that sklearn could be the standard (but perhaps that is a forlorn hope?).
We have tight deadlines for our project, so we decided to use fbpca with our fix...
But I think it makes sense to raise an issue for scikit-learn project. I will perform more tests and will back to this topic to create a github issue for scikit-learn team.
Hi Mark,
I've updated PR. What do you think about this solution?
FYI: I've created the issue for scikit-learn project: https://github.com/scikit-learn/scikit-learn/issues/11102.
Thanks for your improvements! We incorporated the gist of your propositions in the new version.
Some numpy functions like "numpy.random.uniform" and "numpy.ones" use float64 by default. If float32 matrix is specified for PCA, then numpy operations between float32 and float64 matrices create additional matrices which dramatically increase RAM usage.
Float32 allows to fit into memory much bigger matrices, and float32 precision is enough for many use-cases.