data-apis / array-api

RFC document, tooling and other content related to the array API standard
https://data-apis.github.io/array-api/latest/
MIT License
215 stars 46 forks source link

APIs for discovering array structure as pertains to linear algebra operations #675

Closed ilayn closed 7 months ago

ilayn commented 1 year ago

I would like to move the recent discussion from SciPy https://github.com/scipy/scipy/issues/19068 to here for a more specialized treatment.

In the current incarnation of linalg functions, the most important distinction for dispatching to different low level routines are the array structures. Eigenvalues, SVD, linear solve, cholesky, LU, QR and so on all benefit from additional structure to gain performance and reduce the workload.

To that end, typically a quite permissive, fast, and more importantly, with quick-exit hatches, function is implemented to do the early O(n) investigation to discover non-finite entries, triangular, hessenberg, banded structures to report back to the dispatcher.

These type of algorithms typically go by the name of polyalgorithms and becoming quite standard. The most well-known is arguably the backslash polyalgorithm of matlab that selects the right low level implementation automatically when user provides A\b.

There is a stalled (mostly due to missing infrastructure and lack of maintainer time) attempt in SciPy regarding this (https://github.com/scipy/scipy/pull/12824 for more details). However in this issue I'd like to point out that these discovery functions are missing from the array API which is going to be relevant for any library that is going to be implementing linalg functionality.

The naming can be changed but the missing functionality is mainly

scipy.linalg.bandwidth
scipy.linalg.ishermitian
scipy.linalg.issymmetric
scipy.linalg.isfinite  # does not exist yet

Most of SciPy functions are crippled in terms of performance due to lack of the last item since check_finite is a massive performance eater especially for sizes this array API is considering. Similarly, checking for all zeros and other details can be also pushed to native low level code for increased performance. The main discomfort we have as array consumers is that these functions are typically don't quick exit and consider all array entries, mostly using np.all(A == 0) type of shortcuts. This is pretty wasteful as the arrays get larger and provides an unconditional O(n**2) complexity even though probably the first entry is nonzero.

Hence either low-level underscored or exposed to public API these are the functionalities over SciPy we are hoping to get at in the near future. Moreover as we reduce our fortran codebase most of these will be carried over to compiled code for even more reduced overhead.

I'd like to get some feedback about what the array vendors think about this. In fact if this comes out of the box from array vendors that would in fact be much better. Since I did not see any work towards these issues, I wanted to bring the discussion here.

rgommers commented 1 year ago

Thanks for your thoughts @ilayn! I think there are two separate topics here: matrix structure, and check_finite performance/overhead. I think the former is the main topic here. Let me first reply about the latter:

scipy.linalg.isfinite # does not exist yet

Most of SciPy functions are crippled in terms of performance due to lack of the last item since check_finite is a massive performance eater especially for sizes this array API is considering

There is an isfinite function already: https://data-apis.org/array-api/latest/API_specification/generated/array_api.isfinite.html. This function isn't linalg-specific. That can be used by SciPy, so I think we're good there. For the linalg extension in the array API standard, the one thing we should probably do is explicitly say that the behavior in the presence of inf/nan values is undefined, because right now I think the topic isn't explicitly covered at all in the text.

rgommers commented 1 year ago

Efficient structure discovery is interesting. The first question I have is how well-defined that topic is. The utilities are easy to implement in principle perhaps, e.g.:

def issymmetric(x):
    return all(x == matrix_transpose(x))

which I believe is all there is to it for exact symmetry. So to make sure: is that always what you want here, or would one need some inexact approximation to symmetry where if all elements are equal to within a certain tolerance, one chooses a symmetric solver routine? That's what is being done now in SciPy: https://github.com/scipy/scipy/blob/840a9ed0f41f84ce0bbf8c104bfbbbe93781cbdd/scipy/linalg/_cythonized_array_utils.pyx#L239-L323

It seems a little tricky to standardize that, because I imagine it's more an internal routine than something that NumPy, PyTorch et al. would want to expose directly to their end users. If so, perhaps what is needed is more a library-agnostic implementation of this function? For your purposes, it's something SciPy should be able to easily vendor, and it should be an implementation that's guaranteed to work well with PyTorch, CuPy, JAX, et al. Correct?

ilayn commented 1 year ago

Indeed, if we have these per array vendor implementations for such operations, then probably SciPy won't be needing all that home-baked Cythonized array utilities to shave off some microseconds. Then we can rely on performant check routines of the arrays out-of-the box in a unified fashion.

There is an isfinite function already:

Yes but returns an array, not a boolean saying "there is something in there not finite". So for large arrays this is one of the issues

ilayn commented 1 year ago

which I believe is all there is to it for exact symmetry.

Ah yes, this is one of the examples I have. So imagine you have a 5000x5000 array and the [3, 2] element is not equal to [2, 3]. This test goes all the way to scan 5000**2 entries although it is obvious that it will return false in the end. Hence this is what I mean with quick-exit hatches. These are occasionally costing the same time the actual algorithm will take. So we start injecting custom compiled codes in SciPy .

If we want to use less compiled code, array API should actually provide similar performant routines.

rgommers commented 1 year ago

Yes but returns an array, not a boolean saying "there is something in there not finite". So for large arrays this is one of the issues

Ah, okay - got it. Basically, we're missing a fused all_isfinite or some such thing. In the regular case (all values are indeed finite), this will then still have to traverse the whole array, but it would avoid the overhead of creating and then checking the intermediate boolean array. This is a useful thing to do, I'd like to give some thought about how to best do it.

ilayn commented 1 year ago

Yes exactly here is what we are doing currently

https://github.com/scipy/scipy/blob/main/scipy/linalg/_cythonized_array_utils.pyx#L239-L365

rgommers commented 1 year ago

I had a closer look at this:

>>> x = np.zeros((100, 1000), dtype=np.float64)
>>> %timeit np.all(np.isfinite(x))
24.7 µs ± 92.8 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit np.isfinite(x)
17.4 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>>> x[2, 2] = np.inf
>>> %timeit np.isfinite(x)
17.3 µs ± 90.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

So all(isfinite(x)) is fast, but does have a non-negligible cost. An all_isfinite could be faster than isfinite alone in case all values are finite (hard to check, but perhaps 50% faster?). That said, implementing it as, e.g., a loop in Cython code would not be faster probably - this SIMD implementation in NumPy sped up isfinite by 2x-3x. which is likely more than can be gained by writing a Cython loop I think: https://github.com/numpy/numpy/pull/22165.

Use of allclose for structure checks is ~100x slower, because of the naive pure Python implementation in numpy:

>>> %timeit np.allclose(x, 0.0)
2.92 ms ± 7.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

My feeling is that a gain of ~2x for fused functions is a nice-to-have, but not low-hanging fruit. Things like np.allclose, which is used in the _cythonized_array_utils.pyx linked to above, will be easier to speed up and by a much larger factor.

ilayn commented 1 year ago

Ah I should clarify, the np.allclose is used only if the user wants an inexact comparison, that is if things can be off a little bit in the lower and upper part (hence the check for if atol or rtol is provided). In the actual tests it actually goes through the array with double loop.

Nice comparison anyways, here is my results for the code you provided at the top just to give sense of timing of my machine

>>> x = np.zeros((100, 1000), dtype=np.float64)
>>> %timeit np.all(np.isfinite(x))
>>> %timeit np.isfinite(x)
>>> x[2, 2] = np.inf
>>> %timeit np.isfinite(x)
17.9 µs ± 316 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
13.9 µs ± 320 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
13.7 µs ± 343 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

I am going to use the symmetricity test since that we have already implemented in scipy.linalg

>>> A = np.random.rand(100, 100) + 50*np.eye(100)
>>> A = A + A.T
>>> %timeit np.all(A == A.T)
11 µs ± 23.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit la.issymmetric(A)
7.07 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops eac

OK some benefits, nothing to make noise about. But now I'll change the [1,2] element

>>> A[1, 2] = 0
>>> %timeit la.issymmetric(A)
3.36 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit np.all(A == A.T)
11 µs ± 22.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

So here even though our scipy implementation is not even close to cache-friendliness, still gets 4x and that is mostly because we Cythonize the code and not really using NumPy C access speed which will take this easily to 10x levels. The trick is that it stops the test the moment it fails and does not carry all the way to the end. In other words, one evidence is enough. Similar things can be done with isfinite, nonzero structure etc. with similar quick returns.

When linalg functions start checking the arguments these small costs start to pile up and cause a lot of performance even if we manage to find to circumvent f2py and link directly to OpenBLAS etc.