Closed ilayn closed 7 months ago
Thanks for your thoughts @ilayn! I think there are two separate topics here: matrix structure, and check_finite
performance/overhead. I think the former is the main topic here. Let me first reply about the latter:
scipy.linalg.isfinite # does not exist yet
Most of SciPy functions are crippled in terms of performance due to lack of the last item since
check_finite
is a massive performance eater especially for sizes this array API is considering
There is an isfinite
function already: https://data-apis.org/array-api/latest/API_specification/generated/array_api.isfinite.html. This function isn't linalg
-specific. That can be used by SciPy, so I think we're good there. For the linalg
extension in the array API standard, the one thing we should probably do is explicitly say that the behavior in the presence of inf
/nan
values is undefined, because right now I think the topic isn't explicitly covered at all in the text.
Efficient structure discovery is interesting. The first question I have is how well-defined that topic is. The utilities are easy to implement in principle perhaps, e.g.:
def issymmetric(x):
return all(x == matrix_transpose(x))
which I believe is all there is to it for exact symmetry. So to make sure: is that always what you want here, or would one need some inexact approximation to symmetry where if all elements are equal to within a certain tolerance, one chooses a symmetric solver routine? That's what is being done now in SciPy: https://github.com/scipy/scipy/blob/840a9ed0f41f84ce0bbf8c104bfbbbe93781cbdd/scipy/linalg/_cythonized_array_utils.pyx#L239-L323
It seems a little tricky to standardize that, because I imagine it's more an internal routine than something that NumPy, PyTorch et al. would want to expose directly to their end users. If so, perhaps what is needed is more a library-agnostic implementation of this function? For your purposes, it's something SciPy should be able to easily vendor, and it should be an implementation that's guaranteed to work well with PyTorch, CuPy, JAX, et al. Correct?
Indeed, if we have these per array vendor implementations for such operations, then probably SciPy won't be needing all that home-baked Cythonized array utilities to shave off some microseconds. Then we can rely on performant check routines of the arrays out-of-the box in a unified fashion.
There is an isfinite function already:
Yes but returns an array, not a boolean saying "there is something in there not finite". So for large arrays this is one of the issues
which I believe is all there is to it for exact symmetry.
Ah yes, this is one of the examples I have. So imagine you have a 5000x5000 array and the [3, 2] element is not equal to [2, 3]. This test goes all the way to scan 5000**2 entries although it is obvious that it will return false in the end. Hence this is what I mean with quick-exit hatches. These are occasionally costing the same time the actual algorithm will take. So we start injecting custom compiled codes in SciPy .
If we want to use less compiled code, array API should actually provide similar performant routines.
Yes but returns an array, not a boolean saying "there is something in there not finite". So for large arrays this is one of the issues
Ah, okay - got it. Basically, we're missing a fused all_isfinite
or some such thing. In the regular case (all values are indeed finite), this will then still have to traverse the whole array, but it would avoid the overhead of creating and then checking the intermediate boolean array. This is a useful thing to do, I'd like to give some thought about how to best do it.
Yes exactly here is what we are doing currently
https://github.com/scipy/scipy/blob/main/scipy/linalg/_cythonized_array_utils.pyx#L239-L365
I had a closer look at this:
>>> x = np.zeros((100, 1000), dtype=np.float64)
>>> %timeit np.all(np.isfinite(x))
24.7 µs ± 92.8 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
>>> %timeit np.isfinite(x)
17.4 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>>> x[2, 2] = np.inf
>>> %timeit np.isfinite(x)
17.3 µs ± 90.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
So all(isfinite(x))
is fast, but does have a non-negligible cost. An all_isfinite
could be faster than isfinite
alone in case all values are finite (hard to check, but perhaps 50% faster?). That said, implementing it as, e.g., a loop in Cython code would not be faster probably - this SIMD implementation in NumPy sped up isfinite
by 2x-3x. which is likely more than can be gained by writing a Cython loop I think: https://github.com/numpy/numpy/pull/22165.
Use of allclose
for structure checks is ~100x slower, because of the naive pure Python implementation in numpy:
>>> %timeit np.allclose(x, 0.0)
2.92 ms ± 7.55 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
My feeling is that a gain of ~2x for fused functions is a nice-to-have, but not low-hanging fruit. Things like np.allclose
, which is used in the _cythonized_array_utils.pyx
linked to above, will be easier to speed up and by a much larger factor.
Ah I should clarify, the np.allclose is used only if the user wants an inexact comparison, that is if things can be off a little bit in the lower and upper part (hence the check for if atol or rtol is provided). In the actual tests it actually goes through the array with double loop.
Nice comparison anyways, here is my results for the code you provided at the top just to give sense of timing of my machine
>>> x = np.zeros((100, 1000), dtype=np.float64)
>>> %timeit np.all(np.isfinite(x))
>>> %timeit np.isfinite(x)
>>> x[2, 2] = np.inf
>>> %timeit np.isfinite(x)
17.9 µs ± 316 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
13.9 µs ± 320 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
13.7 µs ± 343 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I am going to use the symmetricity test since that we have already implemented in scipy.linalg
>>> A = np.random.rand(100, 100) + 50*np.eye(100)
>>> A = A + A.T
>>> %timeit np.all(A == A.T)
11 µs ± 23.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit la.issymmetric(A)
7.07 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops eac
OK some benefits, nothing to make noise about. But now I'll change the [1,2] element
>>> A[1, 2] = 0
>>> %timeit la.issymmetric(A)
3.36 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit np.all(A == A.T)
11 µs ± 22.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So here even though our scipy implementation is not even close to cache-friendliness, still gets 4x and that is mostly because we Cythonize the code and not really using NumPy C access speed which will take this easily to 10x levels. The trick is that it stops the test the moment it fails and does not carry all the way to the end. In other words, one evidence is enough. Similar things can be done with isfinite, nonzero structure etc. with similar quick returns.
When linalg functions start checking the arguments these small costs start to pile up and cause a lot of performance even if we manage to find to circumvent f2py and link directly to OpenBLAS etc.
I would like to move the recent discussion from SciPy https://github.com/scipy/scipy/issues/19068 to here for a more specialized treatment.
In the current incarnation of linalg functions, the most important distinction for dispatching to different low level routines are the array structures. Eigenvalues, SVD, linear solve, cholesky, LU, QR and so on all benefit from additional structure to gain performance and reduce the workload.
To that end, typically a quite permissive, fast, and more importantly, with quick-exit hatches, function is implemented to do the early O(n) investigation to discover non-finite entries, triangular, hessenberg, banded structures to report back to the dispatcher.
These type of algorithms typically go by the name of polyalgorithms and becoming quite standard. The most well-known is arguably the backslash polyalgorithm of matlab that selects the right low level implementation automatically when user provides
A\b
.There is a stalled (mostly due to missing infrastructure and lack of maintainer time) attempt in SciPy regarding this (https://github.com/scipy/scipy/pull/12824 for more details). However in this issue I'd like to point out that these discovery functions are missing from the array API which is going to be relevant for any library that is going to be implementing linalg functionality.
The naming can be changed but the missing functionality is mainly
Most of SciPy functions are crippled in terms of performance due to lack of the last item since
check_finite
is a massive performance eater especially for sizes this array API is considering. Similarly, checking for all zeros and other details can be also pushed to native low level code for increased performance. The main discomfort we have as array consumers is that these functions are typically don't quick exit and consider all array entries, mostly usingnp.all(A == 0)
type of shortcuts. This is pretty wasteful as the arrays get larger and provides an unconditional O(n**2) complexity even though probably the first entry is nonzero.Hence either low-level underscored or exposed to public API these are the functionalities over SciPy we are hoping to get at in the near future. Moreover as we reduce our fortran codebase most of these will be carried over to compiled code for even more reduced overhead.
I'd like to get some feedback about what the array vendors think about this. In fact if this comes out of the box from array vendors that would in fact be much better. Since I did not see any work towards these issues, I wanted to bring the discussion here.