Compute partial fourier basis (feature suggestion)

scottgigante commented 6 years ago

pygsp.graphs.Graph.compute_fourier_basis raises a warning on large matrices.

if self.N > 3000:
    self.logger.warning('Computing the full eigendecomposition of a '
                        'large matrix ({0} x {0}) may take some '
'time.'.format(self.N))

Is it worth adding a function which computes the first n eigenvectors via sklearn.utils.extmath.randomized_svd? I'm happy to write the code for it. Alternatively, if you don't want the additional overhead of adding sklearn as a dependency, that function could be copied into PyGSP.

I'm happy to submit a PR if this is something you're interested in adding.

mdeff commented 6 years ago

Thanks for the suggestion and considering a PR. :) What would be your use of a partial eigendecomposition?

It can be efficiently computed with the Implicitly Restarted Lanczos Method by scipy.sparse.linalg.eigsh.

scottgigante commented 6 years ago

Thanks for the quick response! I'm often computing the Fourier basis just to examine the first few eigenvectors. I would also use it for approximate spectral clustering on graphs built on large datasets. If you think it's too niche a use case I won't be offended :)

mdeff commented 6 years ago

Those are common use cases I guess. How would you propose to implement that? A n_eigenvectors parameter to the compute_fourier_basis function? Then store the partial eigendecomposition as the Fourier basis?

Any advantage of sklearn.utils.extmath.randomized_svd over scipy.sparse.linalg.eigsh?

scottgigante commented 6 years ago

So (correct me if I'm wrong!) since the Laplacian is symmetric and real, it is Hermitian. Hence, up to a sign, the SVD is equivalent to the eigendecomposition. Randomized SVD is "close" to exact SVD. Is this good enough?

In terms of speed, randomized_svd is faster than eigsh, though not by a lot. This advantage disappears as sparsity increases.

>>> X
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
        with 181832 stored elements in Compressed Sparse Row format>
>>> (X != X.T).sum()
0
>>> %timeit -n 50 sklearn.utils.extmath.randomized_svd(X, 100)
50 loops, best of 3: 570 ms per loop
>>> %timeit -n 50 scipy.sparse.linalg.eigsh(X, 100)
50 loops, best of 3: 806 ms per loop
...
>>> X
<1000x1000 sparse matrix of type '<class 'numpy.float64'>'
        with 9210 stored elements in Compressed Sparse Row format>
>>> (X != X.T).sum()
0
>>> %timeit -n 50 sklearn.utils.extmath.randomized_svd(X, 100)
50 loops, best of 3: 319 ms per loop
>>> %timeit -n 50 scipy.sparse.linalg.eigsh(X, 100)
50 loops, best of 3: 342 ms per loop

In terms of interface, I would prefer compute_fourier_basis(n_eigenvectors=k). Default k=None (exact eigendecomposition) and if (k is None and self.U.shape != self.L.shape) or (k is None and k != self.U.shape[1]), then automatically set recompute=True.

scottgigante commented 6 years ago

For reference, here is how approximate randomized_svd is. This can be changed by adjusting n_iter but given the difference in speed I don't think it would be worth increasing it.

>>> U, S, VT = scipy.sparse.linalg.svds(X, 100)
>>> F, D = scipy.sparse.linalg.eigsh(X, 100)
>>> U_hat, S_hat, VT_hat = sklearn.utils.extmath.randomized_svd(X, 100)
>>> 
>>> np.max(S), np.min(S)
(133.11136412575547, 73.47334348949771)
>>> np.max(np.sort(np.abs(F)) - np.sort(S))
9.947598300641403e-13
>>> np.max(np.sort(np.abs(F)) - np.sort(S_hat))
3.3946159566588676

I sort because F is sorted from lowest to highest and has negatives, S is sorted lowest to highest, and S_hat is sorted highest to lowest.

nperraud commented 6 years ago

Here is my opinion.

An interesting usecase for the first eigenvectors is to get some coordinate for the node. Useful to plot a graph that is built directly from a weight matrix.
I think the idea of having a parameter n for the function compute_fourier_basis.
I have the feeling that it is not worth using randomized_svd by default for a few eigenvector (it could be an option though). The gain of speed is not sufficient to compensate for its slightly random output.

mdeff commented 6 years ago

Thanks for your inputs @nperraud. Some comments:

Agreed with @nperraud's first point. We should also add a kind='laplacian_eigenmap' option to the set_coordinates method which would use the first 2 or 3 eigenvectors as an embedding. If the Fourier basis is not present, it would call compute_fourier_basis.
Agreed with the third too. I'd prefer to stick with scipy.sparse.linalg.eigsh. It avoids the randomness and a dependency on sklearn.
Agreed for the default n_eigenvectors=None for the full eigendecomposition.
I think we should only recompute if the user asks for more eigenvectors than we already have.

Do you agree @scottgigante?

scottgigante commented 6 years ago

All sounds good to me! I'll submit a pull request shortly 👍

mdeff commented 6 years ago

Thanks @scottgigante :)

epfl-lts2 / pygsp

Compute partial fourier basis (feature suggestion) #27