data-apis / array-api

RFC document, tooling and other content related to the array API standard
https://data-apis.github.io/array-api/latest/
MIT License
215 stars 46 forks source link

[Consistency] Default behaviour of SVD vs QR #214

Closed lezcano closed 3 years ago

lezcano commented 3 years ago

Problem The current default behaviour of linalg.svd computes the full decomposition by setting full_matrices=True by default. On the other hand, the default behaviour of linalg.qr specifies mode="reduced" by default.

Proposal Change de default to full_matrices=False in linalg.svd. In most cases the reduced SVD is good enough. Even more, it’s better to do less work by default and just make the more expensive operation opt-in.

cc @rgommers @mruberry

rgommers commented 3 years ago

Thanks for the proposal @Lezcano. It sounds sensible, it would be good to get some confirmation that this is the right thing to do. One way to go about that would be to check some of the svd usages in SciPy. Over 50% of those do not use full_matrices at all, which is likely due to the authors just going with the default. If you pick, say, 3 or 5 examples from SciPy and show that they get more efficient if adding full_matrices=False without introducing algorithmic/accuracy issues, then that'd be clear evidence that changing to the more efficient default makes sense.

lezcano commented 3 years ago

That makes sense. I'll show a few examples here going in the order as they appear when you look for linalg.svd in the SciPy repo together with one from the svd tests. I will omit those where the compute_uv=False is used, as for those the flag full_matrices does nothing.

  1. test_svds.py L19. Called from a number of places in the file. It performs an SVD and then orders the singular values (and the singular vectors) increasingly or decreasingly. In every call within the file, the code makes sure that the k passed is less than k = min(m, n) for a matrix of size m x n. In other words, full_matrices=False should have been used here.

  2. Trust region methods. The SVD is used with full_matrices=False.

  3. Test multivariate normal. The SVD is used without full_matrices to generate a random orthogonal matrix. This is not a good use of svd, it should be replaced by a qr. The reason is that it is possible to prove that it a matrix has entries distributed as a standard normal, the Q of its qr decomposition is distributed as the Haar measure on the orthogonal group. Also, qr is much faster than svd. All this is implemented carefully and efficiently in stats.ortho_group.

  4. K-means clustering. The SVD is used with full_matrices=False.

  5. test_basic.py. It is used without parameters but on a square matrix. As such, the behaviour with full_matrices=False would be the same.

  6. rotation.py It is used without parameters but on a square matrix. As such, the behaviour with full_matrices=False would be the same.

Summary: These first examples show that in all the cases using full_matrices=False is either beneficial or a noop or shows some problems with the code. The most problematic point is the third one, where one might find examples in which the user did intend to use full_matrices=True (for correct reasons or not) and changing the default would break code. That being said, it is clear that the full_matrices=False default is a saner one from an API perspective.

rgommers commented 3 years ago

Thanks @Lezcano! That's convincing to me, +1 for changing the default. And I guess it's time to open a SciPy PR:)

ilayn commented 3 years ago

This is not accurate as in the mistake is that the 'reduced' part is about numpy.linalg.qr.

In scipy.linalg.qr and scipy.linalg.svd all return the full output by default and there is no inconsistency. This is in line with dense usage since most of the use cases require full output rather than the economy output.

shoyer commented 3 years ago

I agree, when I've used svd I cannot think of a case where I did not set full_matrices=False.

kgryte commented 3 years ago

@ilayn Can you elaborate a bit more on "most of the use cases require full output"? Which use cases do you have in mind?

ilayn commented 3 years ago

Like I mentioned almost all. From signal processing to control theory, identification, more esoteric matrix decompositions we always use full matrices. Other than low rank approximations and least squares problems I don't encounter economy mode of svds since often thin svd is comparable to QR to change the basis. Compact SVD is also a low-rank approximation but you can also use iterative methods with better performance for sizes of matrices that would make any difference.

For reduced QR that's a mixed bag since depending on the matrix shape you can get both as useful as possible however that doesn't grant a ticket for changing the default to thin qr. Julia stopped the economy mode in 0.7ish for defaults I think, matlab also doesn't do thin or reduced by default for SVD or QR.

So is the case for Mathematica for SVD but returns thin for QR but I haven't checked since years. Thus the only two I know of that doesn't fit the rest is numpy qr and mathematica qr which are strange. I wasn't around then so I don't know why it is selected.

lezcano commented 3 years ago

I agree with @shoyer. I have never had the need to use full_matrices=True.

In the codebase of SciPy, I just found one use of full_matrices=True that was needed, while there were plenty of uses of full_matrices=False.

In the field that I know best, that of matrix analysis and geometry, thin SVD and reduced QR are often used to simplify certain computations to smaller matrices. For example, when you compute the geodesics on the manifold of matrices with orthonormal columns (the Stiefel manifold) St(n, p) with n > p, you can do so via first doing a thin QR decomposition and simplifying the computation of a matrix exponential from a matrix of size 2n x 2n to one of size 2p x 2p (see eq. 2.45-2.47 in the paper). Same for the Grassmannian via a compact SVD (see Theorem 2.3). In the same article, it is shown how to implement a form of conjugate gradient descent, also via the thin SVD.

In a similar way, a reduced SVD may be used to compute the polar decomposition of a matrix. More generally, it allows for computing the projection of a matrix onto the Stiefel manifold or low-rank approximations (as @ilayn mentioned) via Eckart–Young–Mirsky. While iterative methods based on Lanczos can be more efficient at times, this might not be always the case. For example, on GPUs, iterative methods are incredibly slow, but rather fast GPU implementations of the SVD exist. Given that a number of the frameworks that will hopefully follow the array API support GPU, this is certainly something to keep in mind.

In short, I have encountered the thin SVD in many different applications as a building block for other algorithms, and in all the applications that I have come across, the thin SVD was used. That being said, my experience is certainly biased towards the fields I have studied. As such, it would be helpful if other people could share examples on which the full SVD is useful, as I do not know of any in my field, and there was just one in the SciPy code base.

leofang commented 3 years ago

I agree, when I've used svd I cannot think of a case where I did not set full_matrices=False.

I agree with @shoyer. I have never had the need to use full_matrices=False.

Pardon my poor English, but aren't these two statements contradictory? 😂

lezcano commented 3 years ago

Edited :)

kgryte commented 3 years ago

Based on consortium meeting minutes from 2021-07-29, it was decided to keep the current default of full_matrices for SVD given the lack of consensus regarding whether full output is more generally preferred. Given that this retains the status quo, users will have to opt-in to return a reduced SVD. This is similar to other environments (e.g., MATLAB).

nikitaved commented 2 years ago

My 2 cents on this. If your input is super wide/tall, then having full_matrices=True is inefficient from the memory point of view. If, further, a backprop is required, only min(m, n) size subspaces are used for grad computations. But since these subspaces are views on those potentially huge subspaces, these bases have to be stored for both the backward and forward. I personally find the reduced variant cleaner because we get orthogonal bases for the row and column subspaces without the need of narrowing it down if, for example, we would like to get a projection onto the range/complement subspaces.

jakirkham commented 2 years ago

Yeah we don't even have full_matrices=True as an option in Dask currently. It is implemented as full_matrices=False. Though we do have an open issue about adding this ( https://github.com/dask/dask/issues/3576 )