Related topic: NumPy array protocols

rgommers commented 4 years ago

This issue is meant to summarize the current status and likely future direction of the NumPy array protocols, and their relevance to the array API standard.

What are these array protocols?

In summary, they are dispatching mechanisms that allow calling the public NumPy API with other numpy.ndarray-like arrays (e.g. CuPy or Dask arrays, or any other array that implements the protocols) and have the function call dispatch to that library. There are two protocols, __array_ufunc__ and __array_function__, that are very similar - the difference is that with __array_ufunc__ the library being dispatched to knows it's getting a ufunc and it can therefore make use of some properties all ufuncs have. The dispatching works the same for both protocols though.

Why were they created?

__array_ufunc__ was created first, the original driver was to be able to call numpy ufuncs on scipy.sparse matrices. __array_function__ was created later, to be able to cover most of the NumPy API (every function that takes an array as input) and use the NumPy API with other array/tensor implementations:

What is the current status?

The protocols have been adopted by:

CuPy
Dask
Xarray
MXNet
PyData Sparse
Pint

They have not (or not yet) been adopted by:

Tensorflow (because no compatible API to dispatch to, interest of maintainers unclear)
PyTorch (because no compatible API to dispatch to, maintainers do have interest)
JAX (concerns about added value and backwards compatibility - see NEP 37 introduction)
scipy.sparse (semantics not compatible)

The RAPIDS ecosystem, which builds on Dask and CuPy, has been particularly happy with these protocols, and use them heavily. There they've also run into some of the limitations, the most painful one being that array creation functions cannot be dispatched on.

What is likely to change in the near future?

There is still active exploration of new ideas and design alternatives (or additions to) the array protocols. There's 3 main "contenders":

extend the protocols to cover the most painful shortcomings: NEP 30 (__duckarray__) + NEP 35 (like=).
use a separate module namespace: NEP 37 (__array_module__)
use a multiple dispatch library: NEP 31 (unumpy)

At the moment, the most likely outcome is doing both (1) and (2). It needs prototyping and testing though - any solution should only be accepted when it's clear that it not only solves the immediate pain points RAPIDS ran into, but also that libraries like scikit-learn and SciPy can then adopt it.

What is the relationship of the array protocols with an API standard?

There's several connections:

The original idea of __array_function__ (figure above) doesn't require an API that's the same as the NumPy one, but in practice the protocols can only be adopted when there's an API with matching signatures and semantics.
The lack of an API standard has meant that it's hard to predict what NumPy functions will work for another array library that implements the protocols.
The separate namespaces (__array_module__, unumpy) provide a good opportunity to introduce a new API standard once that's agreed on.

References

oleksandr-pavlyk commented 4 years ago

Scikit-Learn developers are taking a look at NEP-37 vs. NEP-18: https://github.com/scikit-learn/scikit-learn/pull/17676

rgommers commented 4 years ago

Thanks @oleksandr-pavlyk. There's more discussion on that in https://github.com/scikit-learn/scikit-learn/pull/16574, and in the meeting minutes linked above (last link under References).

learning-chip commented 3 years ago

@rgommers Thanks for this useful summary. Do you know if SciPy has any discussions/plans similar to https://github.com/scikit-learn/scikit-learn/pull/16574 ?

There are several efforts reimplementing SciPy:

cupyx.scipy: https://docs.cupy.dev/en/stable/reference/scipy.html
jax.scipy: https://jax.readthedocs.io/en/latest/jax.scipy.html
dask.array: https://docs.dask.org/en/latest/array-api.html

A natural question is, given NEP-18/NEP-37, could I simply copy-and-paste their code, to provide SciPy API on top of other frameworks such as PyTorch/MXNet/TVM? My current answer is NO, due to the following reasons:

Both jax.scipy and cupyx.scipy do not use __array_function__ or __array_module__ for all operations. Cupyx uses import cupy namespace instead of import numpy as np; JAX uses from jax import lax backend to build fine-grained operations, such as jax.ops.index_update in jax.scipy.linalg.lu. So those modules are not portable to other frameworks. (ref #1)
The support of scipy sparse matrix is inconsistent between frameworks. JAX has no sparse matrix data type, and rely on a normal Python function to express scipy LinearOperator (in jax.scipy.sparse.linalg.cg). On the other hand, CuPy has cupyx.scipy.sparse that can utilize cuSPARSE.

Among all SciPy APIs, those submodule seems extremely useful:

Linear algebra (scipy.linalg)
Optimization and root finding (scipy.optimize)
Sparse linear algebra (scipy.sparse.linalg)
Integration and ODEs (scipy.integrate)

Those are also quite useful:

Discrete Fourier transforms (scipy.fft)
Interpolation (scipy.interpolate)
Statistical functions (scipy.stats)
Spatial algorithms and data structures (scipy.spatial)

If those APIs can be reimplemented in pure NumPy, and ported to other AI frameworks like MXNet/PyTorch (via NEP-37), we will automatically get the parallel version of all those solvers with end-to-end autodiff functionality. Sounds like a very big deal :)

rgommers commented 3 years ago

@rgommers Thanks for this useful summary. Do you know if SciPy has any discussions/plans similar to scikit-learn/scikit-learn#16574 ?

Yes, there are some plans. For submodules that are mostly compiled code, it makes sense that there are reimplementations, and it would be good if SciPy became aware of those. A discussion for ndimage, which is needed to make scikit-image work with Dask/CuPy, is in https://github.com/scipy/scipy/issues/10204. The same will apply to special, fft (already has a backend system), and linalg. Those are kind of the low-level building blocks.

For higher-level Python (and maybe Cython) code like in stats and optimize, an approach like NEP 37 (which is effectively superceded by this array API standard and its __array_namespace__ approach) makes more sense. Now that the array API implementation for NumPy is close to being merged, we will start experimenting with that for SciPy.

The support of scipy sparse matrix is inconsistent between frameworks.

Sparse is a bit of a special case. There are exactly zero sparse ndarray/tensor implementations that anyone is happy with. So thinking about dispatching/reuse of code, whether in scipy.sparse, PyData Sparse or elsewhere, is premature.

learning-chip commented 3 years ago

@rgommers Thanks for the replies. Regarding sparse matrix and its operations, I feel that PyTorch has decent support -- see pytorch_sparse and pytorch_scatter, which are used by the pytorch_geometric GNN library.

The question then is, how should existing sparse tensor backends map to numpy/scipy? Candidates include:

The official torch.sparse_coo_tensor
torch_sparse extension
MXNet ndarray.sparse
TVM tvm.contrib.sparse (https://github.com/apache/tvm/pull/1289)
MKL Sparse BLAS
cuSPARSE

etc...

rgommers commented 3 years ago

I feel that PyTorch has decent support

Not yet - a 2-D CSR implementation only just landed in PyTorch (will be released in 1.9.0) and that still lacks support in many operators. And its COO implementation is super slow. The third-party implementations are even more incomplete I believe.

The question then is, how should existing sparse tensor backends map to numpy/scipy?

What we need is a sparse ndarray with the exact same API as numpy.ndarray. And certainly not compatible with scipy.sparse matrices (which have np.matrix semantics with overloading * to mean @). That way sparsity becomes an implementation detail, and sparse arrays will duck-type with regular (dense) arrays. PyData Sparse has the right API for this. PyTorch will also enforce this naturally, because it's not a separate class, but a layout= attribute of torch.Tensor.

learning-chip commented 3 years ago

The third-party implementations are even more incomplete I believe.

This is exactly the reason why I bring up the third party torch_scatter -- it has Scatter, Segment COO, and Segment CSR operators that are reasonably fast on CPUs & GPUs, and are indeed better than the vanilla PyTorch right now.

learning-chip commented 3 years ago

PyData Sparse has the right API for this.

I totally agree that Pydata Sparse API is most consistent with numpy.ndarray semantics. However, from my 8+ years of numerical computing experience, 2-D sparse matrix handles almost all problems from HPC -- from PDE discretization to graph-like connectivity patterns.

N-D sparse tensor is indeed useful, but its use cases rarely come from scientific computing (the main use case of SciPy), but instead come from sparse neural networks in AI (see https://github.com/apache/tvm/issues/4332). For now, AI models rarely use truly sparse N-D tensors (N ≥ 3). Graph Neural Networks simply need a 2-D sparse matrix to express connectivity. Sparsified/pruned neural networks (similar to model compression & quantization) are often limited to block-sparse patterns, with 4x4/8x8 small dense blocks to match GPU's SIMT architecture. Sparse tensors are also incompatible with the systolic array architecture used by Google TPU and AWS Inferentia, furthur limiting their wide usage in AI.

Given the above facts, a natural question is: “is it worthwhile to fully support arbitrary-dimension sparse tensors, given their relatively limited usage?“ The most useful component in scipy.sparse, from my understanding, is the direct/iterative linear solvers in scipy.sparse.linalg (LU, CG, GMRES, etc.). Those solvers work perfectly with 2-D sparse matrix abstraction.

Maybe we just need to fix the annoying np.matrix semantics (caused by historical reasons), but don't really need to perfectly design & implement & optimize the indexing and broadcasting functionalities over N-D sparse tensors? The end goal is simply getting the parallel/distributed/accelerated/differentiable version of SciPy solvers, backed by popular AI frameworks.

leofang commented 3 years ago

However, from my 8+ years of numerical computing experience, 2-D sparse matrix handles almost all problems from HPC -- from PDE discretization to graph-like connectivity patterns.

N-D sparse tensor is indeed useful, but its use cases rarely come from scientific computing (the main use case of SciPy), but instead come from sparse neural networks in AI.

That does not align with my experience. At least from computational physics, having sparse tensors backing a quantum tensor network calculation is very useful. From the most simple matrix product state/tensor train to more sophisticated TN constructions, we can almost always make use of a rank > 2 sparse tensor representation at least for the most common short-range interactions.

Given the above facts, a natural question is: “is it worthwhile to fully support arbitrary-dimension sparse tensors, given their relatively limited usage?“

I'd say yes. There are emerging needs for this capacity.

The most useful component in scipy.sparse, from my understanding, is the direct/iterative linear solvers in scipy.sparse.linalg (LU, CG, GMRES, etc.). Those solvers work perfectly with 2-D sparse matrix abstraction.

Maybe we just need to fix the annoying np.matrix semantics (caused by historical reasons), but don't really need to perfectly design & implement & optimize the indexing and broadcasting functionalities over N-D sparse tensors?

Even taking a step back from tensor networks, having an ndarray backed by sparse tensors is useful to express batched SpMV and SpMM operations, for example, which in turn allow an efficient construction of the batched version of the solvers you mentioned. I have heard in several different occasions recently that people wish for such capability. If there is a sparse ndarray, it can help such needs very naturally. Of course even for GPUs there are quite some work to be done, but I think the investigation has already begun.

learning-chip commented 3 years ago

having sparse tensors backing a quantum tensor network calculation is very useful.

having an ndarray backed by sparse tensors is useful to express batched SpMV and SpMM operations, for example, which in turn allow an efficient construction of the batched version of the solvers you mentioned.

Those are all good points that I am not against to. To rephrase my opinion -- it is indeed necessary to support N-D sparse tensors in the next-generation NumPy/SciPy ecosystem -- the current PyData Sparse library and the sluggish torch.sparse_coo_tensor implementation already kind of support such data type. The question is, how much API design effort and performance optimization effort should we make, in order to build a good-enough N-D sparse tensor library that everyone likes?

Broadcasting over arbitrary-dimension sparse tensors leads to performance optimization difficulties, and there will always be people complaining about the performance of sparse.tensordot on GPUs. We even haven't solved the performance issue for the standard SpMV on CPU -- the AVX-512 FP Gather and Scatter instructions and the ARM SVE gather-load and scatter-store instructions are probably the only relevant architecture designs at this point (or you can also count the historical Cray Vector Machine and emerging RISC-V Vector extension...)

My argument is, given the wide usage of 2-D sparse matrix, won't it be a low-hanging fruit to first design a good-enough 2-D sparse matrix abstraction, to facilitate the development of distributed/accelerated/differentiable versions of commonly used solvers in HPC?

Batched SpMV and SpMM operations can be implemented by 2-D sparse matrix + 1-D extension (3-D in total), and don't necessarily need 100-dimension sparse tensors. taco::Tensor is a good reference for mixing sparse and dense dimensions.

learning-chip commented 3 years ago

Further speaking of HPC use cases, considering the two biggest domains in traditional HPC -- fluid mechanics (weather, climate, engineering) and molecular dynamics (chemistry, biology, combustion). Those physical problems are typically discretized into 3-D/higher dense arrays (to represent physical fields) and 2-D sparse matrices (to present linear operators / physical interactions).

For more complicated numerical algorithms like adaptive mesh refinement (AMR), more advanced data structures are needed -- such as KD-tree for nearest neighbor search. Having a perfect N-D sparse tensor library still can't solve this area of problems.

rgommers commented 3 years ago

Thanks for the thoughts @learning-chip and @leofang. I think there's something to say for both viewpoints, they don't really conflict. In terms of number of use cases: 2-D > batched 2-D & block-sparse > 3-D > n-D.

There's two related but separate issues here:

What are the desired semantics?
What is the priority for a given library/implementation?

(1) is easier: the semantics should match those of dense/arrays tensors, and that statement is independent of the dimensionality of use cases.

For (2), SciPy is 2-D only, PyTorch just added 2-D CSR (a lot faster than COO) and is moving to extensions like batched 2-D and block-sparse.

Maybe we just need to fix the annoying np.matrix semantics (caused by historical reasons), but don't really need to perfectly design & implement & optimize the indexing and broadcasting functionalities over N-D sparse tensors?

That would be great, but it's hard to do given backwards compat issues. It will require a new API. For other efforts, the main thing is to ensure they don't copy SciPy's design mistake. PyData Sparse is ambitious going for n-D straight away, we'll see if they can pull it off.

My argument is, given the wide usage of 2-D sparse matrix, won't it be a low-hanging fruit to first design a good-enough 2-D sparse matrix abstraction, to facilitate the development of distributed/accelerated/differentiable versions of commonly used solvers in HPC?

I'm not sure how much a change in SciPy would influence the API of PyTorch/TF/JAX/MXNet - I hope not at all.

learning-chip commented 3 years ago

In terms of number of use cases: 2-D > batched 2-D & block-sparse > 3-D > n-D.

the semantics should match those of dense/arrays tensors, and that statement is independent of the dimensionality of use cases.

All agreed!

It will require a new API. For other efforts, the main thing is to ensure they don't copy SciPy's design mistake.

This is the exact question I want to ask -- for someone designing a new SciPy-like solver suite, what should the new API like? What are the design mistakes we must avoid?

The current best reference for me is cupyx.scipy.sparse, in terms of API coverage and consistency with SciPy. But as you said, SciPy API is not the best design choice. In particularly, it lacks batched 2-D & block-sparse implementations.

From a compiler point of view, TACO SpMV is a very good reference for optimizing sparse operations. But how should it map to NumPy/SciPy frontend remains a question, and I hope to get some thoughts/advice here.

rgommers commented 3 years ago

From a compiler point of view, TACO SpMV is a very good reference for optimizing sparse operations. But how should it map to NumPy/SciPy frontend remains a question, and I hope to get some thoughts/advice here.

Isn't this just a @ b or matmul(a, b)?

The sparse-specific API should only be constructing the sparse arrays, after that it can follow this array API standard (or numpy, or pytorch, or ... - all equivalent here).

The current best reference for me is cupyx.scipy.sparse, in terms of API coverage and consistency with SciPy. But as you said, SciPy API is not the best design choice. In particularly, it lacks batched 2-D & block-sparse implementations.

This will be the same answer - constructing batched/block-sparse will need specific constructor functions, and then it should be able to use the dense API.

learning-chip commented 3 years ago

Isn't this just a @ b or matmul(a, b)?

Yes, for just SpMV. If the goal is just to build another CuPyx-like solver library (backed by one of PyTorch/MXNet/TVM for example), then changing a * b to a @ b (and copy the rest of SciPy API) is probably enough. If, however, the goal is to design an NEP47-like standard for SciPy/sparse -- to enable effortless switch between tensor backends for SciPy solvers, then there must be some missing standards to be determined. For example:

How many sparse formats (COO, CSR, BSR, LinearOpeartor, batched, etc.) should be supported, an how should they be converted to each other?
For a given solver API (e.g. splu, gmres, svd, eigvals), should it accept all of the sparse formats, or just at least one format?
The distributed storage of sparse matrix elements -- this can be left to implementation.

My questions right now:

Are there any discussions & efforts in the SciPy community about designing such a new API standard (similar to NEP-47)? SciPy has even more historical baggage than NumPy -- scipy.io & scipy.misc & scipy.fftpack are rarely used now, and scipy.odr and scipy.cluster should probably belong to sklearn. There are tons of rooms for API simplification.
If there are no formal discussions & efforts so far, would it be worthwhile to start one? Considering the large user base of SciPy, the need to accelerate serial solvers, and the current fragmented implementations (cupyx.scipy, jax.scipy, etc.). I saw Support for distributed arrays and GPU arrays in SciPy Roadmap, and wonder if there are any follow-ups.

Same question for accelerated sklearn. I will see if sklearn-onnx is the right solution.

rgommers commented 3 years ago

Are there any discussions & efforts in the SciPy community about designing such a new API standard (similar to NEP-47)?

Not about a standard as such. I think to consider that, we first need to show the array API standard being adopted and used in SciPy and other downstream libraries. It's also a much more fuzzy question - it may make sense for some parts of SciPy but not for the package as a whole.

For fft and linalg, they should be supersets of https://data-apis.org/array-api/latest/extensions/linear_algebra_functions.html and numpy.fft (which is also being added as an extension module at https://github.com/data-apis/array-api/pull/189).

For the most widely used functions in special it could make sense, and maybe ndimage and parts of optimize and interpolate. For everything else I think it's a bridge too far, at least in the near future.

SciPy has even more historical baggage than NumPy -- scipy.io & scipy.misc & scipy.fftpack are rarely used now, and scipy.odr and scipy.cluster should probably belong to sklearn. There are tons of rooms for API simplification.

I can't disagree with that:)

If there are no formal discussions & efforts so far, would it be worthwhile to start one? Considering the large user base of SciPy, the need to accelerate serial solvers, and the current fragmented implementations (cupyx.scipy, jax.scipy, etc.).

It would be nice to have a bit of coordination there perhaps. For now I think other libraries are copying the parts they need, which is fine I guess - but I'd rather not see functionality with poor APIs or questionable algorithms copied (looking at you, cusignal ....).

I saw Support for distributed arrays and GPU arrays in SciPy Roadmap, and wonder if there are any follow-ups.

I expect there to be some follow-up yes within the next months. We basically have all the pieces of the puzzle to start connecting things together - see, e.g., https://github.com/scipy/scipy/issues/10204 for how to connect scipy.ndimage, scikit-image, dask and cupy (there is some funding for that work, through the CZI scikit-image grant).

Same question for accelerated sklearn. I will see if sklearn-onnx is the right solution.

My feeling is that scikit-learn is in a much better place, because it has done an exceptional job with careful API design and creating compatible APIs in other packages can be and is being done. Maybe @amueller can say more about this.

learning-chip commented 3 years ago

I expect there to be some follow-up yes within the next months.

Thanks for the information -- what will be the right place to track the follow-ups on accelerating SciPy?

FYI, I will spend my time experimenting with tvm.topi.sparse and pytaco.tensor, to build the accelerated version of scipy.sparse.linalg. For further updates and discussions, should I open an issue here or on SciPy's GitHub issue?

learning-chip commented 3 years ago

FYI, PyTorch 1.9 (released today) reimplements numpy.linalg and scipy.special:

rgommers commented 3 years ago

Thanks for the information -- what will be the right place to track the follow-ups on accelerating SciPy?

The scipy-dev mailing list will be the best place. But I understand subscribing to the mailing list is not ideal. I'll try to remember to add an update to this repo as well.

FYI, I will spend my time experimenting with tvm.topi.sparse and pytaco.tensor, to build the accelerated version of scipy.sparse.linalg. For further updates and discussions, should I open an issue here or on SciPy's GitHub issue?

That sounds cool. For that topic I'd suggest opening a tracking issue on the SciPy repo.

data-apis / array-api

Related topic: NumPy array protocols #1