KrishnaswamyLab / graphtools

Tools for building and manipulating graphs in Python
GNU General Public License v3.0
41 stars 15 forks source link

Inaccuracy in randomized SVD at moderate ranks #58

Open stanleyjs opened 3 years ago

stanleyjs commented 3 years ago

Hi,

Randomized SVD is not accurate

Currently most (if not all) of the PCA / linear dimensionality reduction / SVD is first routed through either TruncatedSVD or PCA(svd_solver='randomized'). It turns out that this solver can be pretty bad at computing even moderate rank SVDs. Consider this pathological example in which we create a 1000 x 500 matrix with np.hstack([np.zeros(249,),np.arange(250,501)]) as its spectrum. The matrix is rank 250. We will also consider its rank-50 reconstruction and its rank 1 approximation.

import numpy as np
from sklearn.decomposition import TruncatedSVD

L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 50
r = 250
lowrank_r = 50
m = 1000
n = 500
for ix in range(10):
    x = np.random.randn(m,n)
    ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
    sx = np.hstack([np.zeros(r-1,),np.arange(250,501)])[::-1] # the ground truth singular values we will use
    sx_lowrank = np.hstack([np.zeros(449,),np.arange(450,501)])[::-1] # the spectrum of the best rank-50 approximation
    y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
    y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
    pca_operator = TruncatedSVD(k) 
    pca_operator.fit(y)
    sy = pca_operator.singular_values_ # our estimate of the singular values
    y_transformed = pca_operator.transform(y)
    y_reconstructed = y_transformed@pca_operator.components_ # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
    L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy) # how close did we get to estimating the top 50 singular values in L2?
    F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
>>> print(np.mean(L2_sv_error))
55.77929710543534
>>> print(np.mean(F_lowrank_error))
1930.877814905553

It is clear that there is a problem

It turns out that we can increase k and our estimate gets better

L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 100
r = 250
lowrank_r = 50
m = 1000
n = 500
for ix in range(10):
    x = np.random.randn(m,n)
    ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
    sx = np.hstack([np.zeros(r-1,),np.arange(r,n+1)])[::-1] # the ground truth singular values we will use
    sx_lowrank = np.hstack([np.zeros(n-lowrank_r-1,),np.arange(n-lowrank_r,n+1)])[::-1] # the spectrum of the best rank-50 approximation
    y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
    y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
    pca_operator = TruncatedSVD(k) 
    pca_operator.fit(y)
    sy = pca_operator.singular_values_ # our estimate of the singular values
    y_transformed = pca_operator.transform(y)
    y_reconstructed = y_transformed[:,:lowrank_r]@pca_operator.components_[:lowrank_r,:] # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
    L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy[:lowrank_r]) # how close did we get to estimating the top 50 singular values in L2?
    F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
>>> print(np.mean(L2_sv_error))
7.755560034045646
>>> print(np.mean(F_lowrank_error))
729.3025096061267

We can also decrease the rank of the underlying approximation to get better accuracy. What is happening here is that randomized_svd gets more accurate when there are more singular vectors requested, proportional to the rank of the matrix. As n_components gets closer to (and larger than) to the rank of the matrix, the algorithm gets more accurate. Let's finally look at the extreme case and compare our rank 1 approximation. The task here is to only estimate a single singular pair.

L2_sv_error = np.zeros(10,)
F_lowrank_error = np.zeros(10,)
k = 1
r = 250
lowrank_r = 1
m = 1000
n = 500
for ix in range(10):
    x = np.random.randn(m,n)
    ux, _ , vhx = np.linalg.svd(x,full_matrices=False) # get 1000 x 1000 and 500 x 500 orthogonal matrices
    sx = np.hstack([np.zeros(r-1,),np.arange(r,n+1)])[::-1] # the ground truth singular values we will use
    sx_lowrank = np.hstack([np.zeros(n-lowrank_r-1,),np.arange(n-lowrank_r,n+1)])[::-1] # the spectrum of the best rank-50 approximation
    y = (ux*sx)@vhx # the ground truth matrix we want to approximate with SVD/PCA
    y_lowrank = (ux*sx_lowrank) @ vhx # the best rank-50 approximation if we had the exact spectrum
    pca_operator = TruncatedSVD(k) 
    pca_operator.fit(y)
    sy = pca_operator.singular_values_ # our estimate of the singular values
    y_transformed = pca_operator.transform(y)
    y_reconstructed = y_transformed[:,:lowrank_r]@pca_operator.components_[:lowrank_r,:] # the previous two operations are equivalent to pca_operator.inverse_transform(pca_operator.transform(y)), this is our estimated rank 50 approximation
    L2_sv_error[ix] = np.linalg.norm(sx[:lowrank_r]-sy[:lowrank_r]) # how close did we get to estimating the top 50 singular values in L2?
    F_lowrank_error[ix] = np.linalg.norm(y_lowrank-y_reconstructed) # how close is our rank 50 approximation?
print(np.mean(L2_sv_error))
print(np.mean(F_lowrank_error))
>>> print(np.mean(L2_sv_error))
10.74597935589037
>>> print(np.mean(F_lowrank_error))
772.1048200820816

We can make the algorithm more accurate

It turns out that there are a lot of edge cases and examples where randomized svd will fail either because the matrix is too large, ill-conditioned, the rank is too high, etc. However, there are a few parameters that can be tweaked in the inner function of randomized svd, sklearn.utils.extmath.randomized_svd to make things more accurate. The biggest one is n_oversamples, and then n_iters

How to change graphtools

I propose that we replace all calls to PCA and Truncated SVD with explicit calls to randomized_svd and we set sensible n_oversamples as a factor of the requested n_pca. The default is not very good. The sklearn documentation suggests for a rank k matrix n_oversamples should be 2*k-n_components or just simply n_components when n_components >= k, but I have found for hard problems this is not enough. We can also add an svd_kwargs keyword argument to the graph constructors to allow passing through kwargs to randomized SVD to increase accuracy or trade accuracy for performance.

@scottgigante @dburkhardt

dburkhardt commented 3 years ago

Thanks for looking into this @stanleyjs. This is a hectic week for me but I'll make a note to come back to this. Fundamentally I have no issue with directly calling randomized_svd and setting parameters to provide better estimates so long as it doesn't lead to using dramatically more memory or compute time. Have you looked into the resource usage with higher n_oversamples or n_iters?

stanleyjs commented 3 years ago

@dburkhardt Here is a plot of the errors image And you can see the notebook that created it here https://gist.github.com/stanleyjs/cb223cedb913942c4f9349b53f800ced

Clearly it's an issue. However, I was thinking of maybe just submitting a PR upstream in sklearn to add n_oversamples to TruncatedSVD

scottgigante commented 3 years ago

TBH I think sklearn might be the right place to fix this. If the PR is rejected then we should add it here, but no reason why more people shouldn't benefit from this fix.

On Thu, 18 Nov 2021 at 14:45, Jay Stanley @.***> wrote:

@dburkhardt https://github.com/dburkhardt Here is a plot of the errors [image: image] https://user-images.githubusercontent.com/16860172/142485537-583b4b42-6b5b-4814-b214-bb7517a6b142.png And you can see the notebook that created it here https://gist.github.com/stanleyjs/cb223cedb913942c4f9349b53f800ced

Clearly it's an issue. However, I was thinking of maybe just submitting a PR upstream in sklearn to add n_oversamples to TruncatedSVD

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/KrishnaswamyLab/graphtools/issues/58#issuecomment-973202082, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA3DXZN76MAIX6YASFT6W3UMVJV7ANCNFSM5IF4H3RQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

stanleyjs commented 3 years ago

See https://github.com/scikit-learn/scikit-learn/pull/21705

stanleyjs commented 2 years ago

@scottgigante @dburkhardt it appears that sklearn will probably fix this gap, but not without some internal discussion over the details of the API.

I am wondering if we should go ahead and patch in the randomized svd kwargs. Also, I notice that we'd only have to patch in a workaround for sparse matrices / TruncatedSVD - it looks like PCA (the dense matrix class) has the n_oversamples argument.

scottgigante commented 2 years ago

Fine by me if you want to write the patch. Probably easiest is to monkey-patch with a maximum version on sklearn (set to the current version+1)

On Wed, 15 Dec 2021, 8:04 am Jay Stanley, @.***> wrote:

@scottgigante https://github.com/scottgigante @dburkhardt https://github.com/dburkhardt it appears that sklearn will probably fix this gap, but not without some internal discussion over the details of the API.

I am wondering if we should go ahead and patch in the randomized svd kwargs. Also, I notice that we'd only have to patch in a workaround for sparse matrices / TruncatedSVD - it looks like PCA (the dense matrix class) has the n_oversamples argument.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/KrishnaswamyLab/graphtools/issues/58#issuecomment-994771625, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACA3DX22KPTASGSZ4SBFCBTURCG4ZANCNFSM5IF4H3RQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

dburkhardt commented 2 years ago

I agree with Scott here. Thanks for doing this @stanleyjs! Let me know if you need any help.