RFC: add a way to convert back to Python (`tolist`)

vnmabus commented 7 months ago

I think (correct me if I am mistaken) that currently the only way to convert an array object back to a Python representation is to call float, int, bool, etc on 0D arrays. This requires that the user knows the appropriate function to call and does not offer any standard way to retrieve the underlying Python object when the library has additional dtypes, such as object in NumPy.

Moreover, as there is no tolist in the standard, it is also not possible to obtain a list representation of the array (from which the Python object could be retrieved).

I propose to add tolist to the standard, as defined in NumPy and Pytorch to deal with these cases. Although the name is a bit misleading (because for 0D arrays there is no list at all), I think that prior art justifies reusing that name.

leofang commented 7 months ago

There're some difficulties that NumPy folks are discussing: https://github.com/numpy/numpy/issues/24989

vnmabus commented 7 months ago

Well, maybe the standard should introduce a new name then, such as topython, so that people are not confused by the name.

rgommers commented 7 months ago

The semantic issues with tolist are real, and I'm also not sure that this should be supported for n-D arrays. If there's a need for this, it'd be better to add the relevant dunder method so list(x) works, a separate function or method doesn't seem great.

Would one actually need this for arrays of arbitrary dimensionality? Some real-world examples would be good to see.

vnmabus commented 7 months ago

I don't know which dunder method would that be. I think currently list(x) would return a list of 0D arrays for arrays that follow the standard (not current NumPy ndarray), instead of a list of Python types. If you are interested in just a normal list, a possibility would be to offer an iterator over the elements of the array (like NumPy's ndarray.flat), so that you can do list(x.flat). I am not sure if there are use cases for a multidimensional tolist.

My use case was for 0D arrays, more similar to NumPy's ndarray.item(). However, I thought that it was preferable to have one dimension-independent function to retrieve the Python objects (so, similar to how tolist behaves), rather than including just item() only for it to become redundant if something like tolist is added later.

rgommers commented 7 months ago

For 0-D arrays, list(float(x)) should work already. Extend a little if it needs to be generic over all dtypes, by checking with isdtype - that's not a bad thing because it's not clear whether you'd want uint* - Python int.

vnmabus commented 7 months ago

I think there was a misunderstanding... I do not want a list returned for 0D arrays, but a dtype- independent way to convert them to a Python object that can hold them, that works also for non-standard dtypes.

seberg commented 7 months ago

What you need to use is [float(x) for x in arr.ravel()], since iteration behavior is unspecified (assuming you know you want a Python float).

NumPy has the .tolist()/.item() (also object casts actually) which have a preference to convert to the corresponding Python type when deemed reasonable (not saying that what it deems "reasonable" is actually reasonable).

As I said on the NumPy issue, maybe raveling would be the more useful default behavior, although I am not sure... Loosing dimensions is also surprising! But then I also think that iterating all elements would generally be a nice thing for array objects (although, I realize that would require user teaching and better ways to iterator a single axis).

Whatever the solution, maybe a new name is fine, maybe one should just keep the tolist name but make raveled/flattened=True/False compulsory (i.e. it is undefined if not passed and the "minimal" implementation used for testing would raise).

betatim commented 7 months ago

This issue made me wonder about converting from one namespace to another. Say from PyTorch to Numpy. This works:

x = array_api_compat.torch.asarray([1,2,3])
array_api_compat.numpy.asarray(x) # -> array([1, 2, 3])

The reason I was thinking about this was that it would be nice to have a consistent way of converting things. Of course, there is no asarray for normal Python, so this is more of a thought experiment.

oleksandr-pavlyk commented 7 months ago

We did discuss a possibility to standardize bringing data from any array object to Python. It would make sense to have a function that would transfer content of array into another type that exposes Python buffer protocol. From here the content could be converted to NumPy, or passed to xp.asarray in another library.

rgommers commented 7 months ago

I think there was a misunderstanding... I do not want a list returned for 0D arrays, but a dtype- independent way to convert them to a Python object that can hold them, that works also for non-standard dtypes.

Non-standard dtypes may not have a pure Python equivalent, so that's quite tricky clearly. Things like datetime may work, for different precisions like float128 it's hard to determine whether it's fine to downcast to 64-bit float's, etc. I don't think there should be a too magical do-it-all function. The current issues with numpy's .tolist show that that's a problem. It's easy enough to write for 0-D arrays with the set of dtypes that you care about something like:

def convert_0D_arrays(x):
    if not x.ndim == 0:
        raise ValueError('...')

    if xp.isdtype(x, 'real floating'):
        return float(x)
    elif xp.isdtype(x, 'complex floating'):
        return complex(x)
    # etc.

Static typing is also easier outside of a magical do-it-all function, because you can add the overloads for the different return types.

The reason I was thinking about this was that it would be nice to have a consistent way of converting things.

This can be done with from_dlpack, or with asarray.

vnmabus commented 7 months ago

Non-standard dtypes may not have a pure Python equivalent, so that's quite tricky clearly. Things like datetime may work, for different precisions like float128 it's hard to determine whether it's fine to downcast to 64-bit float's, etc. I don't think there should be a too magical do-it-all function. The current issues with numpy's .tolist show that that's a problem. It's easy enough to write for 0-D arrays with the set of dtypes that you care about something like:
def convert_0D_arrays(x):
    if not x.ndim == 0:
        raise ValueError('...')

    if xp.isdtype(x, 'real floating'):
        return float(x)
    elif xp.isdtype(x, 'complex floating'):
        return complex(x)
    # etc.

The thing is that if the arrays implement additional dtypes apart from the standard ones (something allowed in the standard, as far as I know), this is not so easy to do from the user side. Consider for example the datetime extension you mentioned. There is no dunder equivalent like __float__ for datetime. Thus:

datetime(x) will likely not work.
Dlpack does not help here, as it only supports numeric types.
The buffer protocol won't help you either as that format is not supported for a memoryview.

So, what can a user do to retrieve the object?

And note that although it is valid to say "this is not a problem for the standard, as it only requires numerical types with dunder methods", I still see the value in standardizing at least the naming and interface of a function similar to NumPy item that provides the "best" way to represent a scalar quantity as a Python object, as intended by the array developers. Depending on how that is standardized, it could even be the case that the value returned for a, for example, float64 dtype is not a Python float. For example, a library that wraps arrays compatible with the standard and adds physical units on top, presenting itself a similar API to the standard, could implement item as returning objects with units attached.

rgommers commented 7 months ago

I still see the value in standardizing at least the naming and interface of a function similar to NumPy item that provides the "best" way to represent a scalar quantity as a Python object,

There's a reasonable amount of consensus, among both NumPy devs and devs from other array libraries, that NumPy's scalars were a design mistake. They add a large amount of complexity, and we'd remove them from NumPy if we could (but, backwards compat). So I don't think this is going to fly.

Consider for example the datetime extension you mentioned. There is no dunder equivalent like float for datetime.

There is only one library that supports datetime dtypes, namely NumPy. So you can explicitly handle that case with a NumPy function.

leofang commented 7 months ago

Just thinking out loud... If we agree that a "0D list" is an ill-defined construct, perhaps we can at least have a clean .tolist() semantics on unambiguous cases? From a purist perspective, in addition to always get a return value of type list, it's also very good to try and preserve the dimension/shape of an array, for facilitating a correct round trip (tolist -> asarray, or vice versa).

For N-D arrays (N > 0):
- if all axes have nonzero lengths, .tolist() returns an N-nested Python list, with lengths of the inner lists dictated by .shape
  - ex: .shape = (2,), output = [1, 2]
  - ex: .shape = (2, 3), output = [[1, 2, 3], [2, 3, 4]]
  - ...
- if any inner-most, fast-running axis has length 0, that axis is an empty list
  - this is because in order to get a well-defined semantics, .tolist() must assume C order
  - ex: .shape = (0,), output = []
  - ex: .shape = (2, 0), output = [[], []]
  - ex: .shape = (2, 3, 0), output = [[[], [], []], [[], [], []]]
  - ...
- otherwise, we either raise an exception, or make the behavior implementation defined (and not standardize it)
  - ex: .shape = (0, 2)
  - ex: .shape = (2, 0, 4)
  - ex: .shape = (0, 3, 4)
  - ...
For 0-D arrays:
- Same, we either raise an exception (and tell users to use builtin functions like int(), float(), ... to get a Python scalar),
- Or make the behavior implementation defined (and not standardize it)

rgommers commented 7 months ago

That seems reasonable @leofang. However, there are going to be other exceptions aside from 0-D arrays, because leaving array land isn't always possible. E.g., what about non-CPU devices or detaching from an autograd graph?

vnmabus commented 7 months ago

In case that this is implemented I would rather have the natural implementation for 0D arrays: returning a non-list (either a Python representation of the scalar value itself or the array unchanged) so that array(0).tolist() == array([0]).tolist()[0]. If the name tolist is considered problematic for something that does not necessarily return a list, I would change the name, rather than raising an exception for a case where the natural behavior is obvious.

fcharras commented 7 months ago

Adding to

@betatim

This issue made me wonder about converting from one namespace to another. Say from PyTorch to Numpy. This works:
x = array_api_compat.torch.asarray([1,2,3])
array_api_compat.numpy.asarray(x) # -> array([1, 2, 3])
The reason I was thinking about this was that it would be nice to have a consistent way of converting things. Of course, there is no asarray for normal Python, so this is more of a thought experiment.

and

@oleksandr-pavlyk

We did discuss a possibility to standardize bringing data from any array object to Python. It would make sense to have a function that would transfer content of array into another type that exposes Python buffer protocol. From here the content could be converted to NumPy, or passed to xp.asarray in another library.

remarks,

should we use a separate issue that covers inter-namespace conversion specifically rather than tolist ?

I want to emphasize with this usecase I have when trying to adapt code for Array API compliance. There is some code that can't compromise on numerical accuracy and absolutly requires at least float64 precision, but I could use an integrated GPU that supports at most float32 (e.g using mps or xpu backends using pytorch) for everything else. For this I would have to transfer data from device to cpu, run the float64 compute, and transfer back to device. But .to_device("cpu"), this is not part of the standard and some array libraries might not support it (like cupy arrays) so I can't rely on it. .from_dlpack does not support inter-device conversion so it's not appropriate either.

For this usecase an intermediate object that enable inter-device and inter-namespace conversion surely would be practical.

tolist have been mentionned but also conversion to and from numpy is commonly supported:

torch have Tensor.numpy and torch.from_numpy
cupy have cupy.asnumpy and cupy.asarray works with numpy arrays
jax have jax.numpy.array and np.asarray works with jax arrays
dask from_array supports numpy inputs and np.asarray works with dask arrays
tensorflow have Tensor.numpy and tf.convert_to_tensor()
mxnet have NDArray.asnumpy and array supports numpy inputs
dpctl.tensor have asnumpy and dpctl.tensor.asarray supports numpy inputs

wouldn't it be practical to add to the Array API a conversion to numpy, e.g to_numpy or asnumpy ? (from_numpy doesn't seem as necessary since asarray or from_dlpack commonly already works with numpy inputs)

asmeurer commented 6 months ago

But .to_device("cpu"), this is not part of the standard and some array libraries might not support it (like cupy arrays) so I can't rely on it.

rgommers commented 6 months ago

@fcharras thanks for your thoughts! I've copied your comment to gh-626, so we can keep that "to host" topic there, and keep this one for .tolist.

data-apis / array-api

RFC: add a way to convert back to Python (`tolist`) #710