data-apis / array-api

RFC document, tooling and other content related to the array API standard
https://data-apis.github.io/array-api/latest/
MIT License
205 stars 42 forks source link

Iteration on 1-D arrays #818

Open asmeurer opened 6 days ago

asmeurer commented 6 days ago

Recently in array-api-strict, I accidentally disabled iteration on 1-D arrays. This broke a lot of code in SciPy. I've since reverted the change (array-api-strict disallows iteration on >1-D arrays but allows it for 1-D arrays).

There have been discussions in the past about now allowing iteration on arrays https://github.com/data-apis/array-api/issues/188. Disallowing it for higher dimensional arrays is probably fine, but it's unclear whether a library like array-api-strict should disallow it for 1-D arrays. The reason is that technically speaking, an array object that implements on the methods defined in the standard would allow iteration on 1-D arrays. This is because by default if __iter__ is not defined but __getitem__ is, Python defines iteration as a[0], a[1], etc.

Given how painful this can be for upstream code, I wonder if we should make it explicit in the standard that iteration is defined for 1-D arrays.

A possible counterargument is that the new unstack function can be used to iterate on an array of any dimension. unstack(x) is the same as iter(x) in the NumPy sense of iteration (it iterates the elements if x is 1-dimensional and along the first axis if it is n-dimensional).

seberg commented 6 days ago

FWIW, from an ideal perspecitive, I still think arr.iter(axis) or .iteraxis() would be the best API. (Default could be 0 or None, or undefined here.)

Once you define __iter__, unstack actually really is always just the tuple(arr) one-liner.

Remmber, that the other argument was e.g. sympy, which doesn't use the list of list analogy and iterates all elements in its Matrix. So the reason for that is, that conceptually there are other choices, and those choices may actually be better where it not for the fact that most users are indoctrinated to the list-of-list view of things.

asmeurer commented 6 days ago

Well my question here is specifically about the 1-D case. I don't think there is any ambiguity in that case, and based on the scipy changes, it seems to be much more common.

seberg commented 6 days ago

Ah, so allow 1-D iteration on 1-d arrays. Not sure how important it is, but that makes sense to me. And as you said, just having __getitem__ makes Python already think it should be a sequence/iterable, I guess. So I don't really see a downside to it. The 1-D limitation may be a bit awkward in practice, so not sure it is a big advantage to promise it works, but it is likely common enough.

mdhaber commented 6 days ago

Are there examples besides SymPy of linear indexing in Python? I know Matlab does it, too, but I wonder why these matrix-centered implementations should govern what the array API does, given that NumPy, CuPy, PyTorch, JAX, tensorflow, and dask.array seem to agree. Never mind if x[i, ...] is allowed for multidimensional arrays.

rgommers commented 6 days ago

I agree it would be useful to document whether 1-D iteration is supported, explicitly must raise, or is undefined. The most important data point is: do all libraries currently allow 1-D iteration? Would you be able to check @asmeurer?

rgommers commented 6 days ago

for the fact that most users are indoctrinated to the list-of-list view of things.

For the record: I don't think this is true, and I don't know of data on how to prove/disprove it either way. There's a lot of users who will think about this as 2-D/3-D regular grids and can visualize it like that (also the case for me), which is much more intuitive for for example physicists than "list of lists".

seberg commented 5 days ago

which is much more intuitive for for example physicists

N-D is intuitive, but the question is what you think when you see for x in arr, and I think that is the list-of-list style of iteration. And I have seen a lot of nested for loops over arrays even by users who work with NumPy quite a lot.

So yeah, it is intuitive for physicist. But I still think when it comes down to it, even many of those who find N-D intuitive, will probably reach to the list-of-list analogy when they see a for loop. (Rather than one where you might just iterate all elements because you see it as a collection of elements first, with an N-D structure second.)

asmeurer commented 5 days ago

PyTorch, jax.numpy, dask.array, and surprisingly even sparse all allow 1-D iteration. They all actually seem to just follow NumPy on n-D iteration (I didn't test CuPy but it's obviously the same as NumPy).

mdhaber commented 5 days ago

In case it matters, I tested TF last night and it also seems to follow NumPy on n-D iteration.

import tensorflow as tf
x = tf.constant([[1, 2, 3], [4, 5, 6]])
x[0]  # <tf.Tensor: shape=(3,), dtype=int32, numpy=array([1, 2, 3], dtype=int32)>
for row in x:
    print(row)
# tf.Tensor([1 2 3], shape=(3,), dtype=int32)
# tf.Tensor([4 5 6], shape=(3,), dtype=int32)

Same with xarray. I couldn't get Weld, Bohrium, Arkouda, or Legate to work on Colab, but ChatGPT tells me that in Weld a 2d array would be a vector of vectors, in Arkouda a 2d array would be a dictionary of 1d arrays, and Legate and Bohrium are supposed to be drop-in replacements for NumPy, so I would expect those to follow the same convention to the extent that multidimensional input is accepted. I didn't test MXNet since the project seems to have been retired.