Open stanleyjs opened 3 years ago
Thanks for looking into this @stanleyjs. This is a hectic week for me but I'll make a note to come back to this. Fundamentally I have no issue with directly calling randomized_svd
and setting parameters to provide better estimates so long as it doesn't lead to using dramatically more memory or compute time. Have you looked into the resource usage with higher n_oversamples
or n_iters
?
@dburkhardt Here is a plot of the errors And you can see the notebook that created it here https://gist.github.com/stanleyjs/cb223cedb913942c4f9349b53f800ced
Clearly it's an issue. However, I was thinking of maybe just submitting a PR upstream in sklearn to add n_oversamples
to TruncatedSVD
TBH I think sklearn might be the right place to fix this. If the PR is rejected then we should add it here, but no reason why more people shouldn't benefit from this fix.
On Thu, 18 Nov 2021 at 14:45, Jay Stanley @.***> wrote:
@dburkhardt https://github.com/dburkhardt Here is a plot of the errors [image: image] https://user-images.githubusercontent.com/16860172/142485537-583b4b42-6b5b-4814-b214-bb7517a6b142.png And you can see the notebook that created it here https://gist.github.com/stanleyjs/cb223cedb913942c4f9349b53f800ced
Clearly it's an issue. However, I was thinking of maybe just submitting a PR upstream in sklearn to add n_oversamples to TruncatedSVD
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/KrishnaswamyLab/graphtools/issues/58#issuecomment-973202082, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA3DXZN76MAIX6YASFT6W3UMVJV7ANCNFSM5IF4H3RQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
@scottgigante @dburkhardt it appears that sklearn will probably fix this gap, but not without some internal discussion over the details of the API.
I am wondering if we should go ahead and patch in the randomized svd kwargs. Also, I notice that we'd only have to patch in a workaround for sparse matrices / TruncatedSVD - it looks like PCA (the dense matrix class) has the n_oversamples argument.
Fine by me if you want to write the patch. Probably easiest is to monkey-patch with a maximum version on sklearn (set to the current version+1)
On Wed, 15 Dec 2021, 8:04 am Jay Stanley, @.***> wrote:
@scottgigante https://github.com/scottgigante @dburkhardt https://github.com/dburkhardt it appears that sklearn will probably fix this gap, but not without some internal discussion over the details of the API.
I am wondering if we should go ahead and patch in the randomized svd kwargs. Also, I notice that we'd only have to patch in a workaround for sparse matrices / TruncatedSVD - it looks like PCA (the dense matrix class) has the n_oversamples argument.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/KrishnaswamyLab/graphtools/issues/58#issuecomment-994771625, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA3DX22KPTASGSZ4SBFCBTURCG4ZANCNFSM5IF4H3RQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I agree with Scott here. Thanks for doing this @stanleyjs! Let me know if you need any help.
Hi,
Randomized SVD is not accurate
Currently most (if not all) of the PCA / linear dimensionality reduction / SVD is first routed through either
TruncatedSVD
orPCA(svd_solver='randomized')
. It turns out that this solver can be pretty bad at computing even moderate rank SVDs. Consider this pathological example in which we create a 1000 x 500 matrix withnp.hstack([np.zeros(249,),np.arange(250,501)])
as its spectrum. The matrix is rank 250. We will also consider its rank-50 reconstruction and its rank 1 approximation.It is clear that there is a problem
It turns out that we can increase
k
and our estimate gets betterWe can also decrease the rank of the underlying approximation to get better accuracy. What is happening here is that
randomized_svd
gets more accurate when there are more singular vectors requested, proportional to the rank of the matrix. Asn_components
gets closer to (and larger than) to the rank of the matrix, the algorithm gets more accurate. Let's finally look at the extreme case and compare our rank 1 approximation. The task here is to only estimate a single singular pair.We can make the algorithm more accurate
It turns out that there are a lot of edge cases and examples where randomized svd will fail either because the matrix is too large, ill-conditioned, the rank is too high, etc. However, there are a few parameters that can be tweaked in the inner function of randomized svd,
sklearn.utils.extmath.randomized_svd
to make things more accurate. The biggest one isn_oversamples
, and thenn_iters
How to change graphtools
I propose that we replace all calls to PCA and Truncated SVD with explicit calls to
randomized_svd
and we set sensiblen_oversamples
as a factor of the requestedn_pca
. The default is not very good. The sklearn documentation suggests for a rankk
matrixn_oversamples
should be2*k-n_components
or just simplyn_components
whenn_components >= k
, but I have found for hard problems this is not enough. We can also add ansvd_kwargs
keyword argument to the graph constructors to allow passing through kwargs to randomized SVD to increase accuracy or trade accuracy for performance.@scottgigante @dburkhardt