Open vnmabus opened 7 months ago
There're some difficulties that NumPy folks are discussing: https://github.com/numpy/numpy/issues/24989
Well, maybe the standard should introduce a new name then, such as topython
, so that people are not confused by the name.
The semantic issues with tolist
are real, and I'm also not sure that this should be supported for n-D arrays. If there's a need for this, it'd be better to add the relevant dunder method so list(x)
works, a separate function or method doesn't seem great.
Would one actually need this for arrays of arbitrary dimensionality? Some real-world examples would be good to see.
I don't know which dunder method would that be. I think currently list(x)
would return a list of 0D arrays for arrays that follow the standard (not current NumPy ndarray
), instead of a list of Python types. If you are interested in just a normal list, a possibility would be to offer an iterator over the elements of the array (like NumPy's ndarray.flat
), so that you can do list(x.flat)
. I am not sure if there are use cases for a multidimensional tolist
.
My use case was for 0D arrays, more similar to NumPy's ndarray.item()
. However, I thought that it was preferable to have one dimension-independent function to retrieve the Python objects (so, similar to how tolist
behaves), rather than including just item()
only for it to become redundant if something like tolist
is added later.
For 0-D arrays, list(float(x))
should work already. Extend a little if it needs to be generic over all dtypes, by checking with isdtype
- that's not a bad thing because it's not clear whether you'd want uint*
- Python int
.
I think there was a misunderstanding... I do not want a list returned for 0D arrays, but a dtype- independent way to convert them to a Python object that can hold them, that works also for non-standard dtypes.
What you need to use is [float(x) for x in arr.ravel()]
, since iteration behavior is unspecified (assuming you know you want a Python float
).
NumPy has the .tolist()
/.item()
(also object casts actually) which have a preference to convert to the corresponding Python type when deemed reasonable (not saying that what it deems "reasonable" is actually reasonable).
As I said on the NumPy issue, maybe raveling would be the more useful default behavior, although I am not sure... Loosing dimensions is also surprising! But then I also think that iterating all elements would generally be a nice thing for array objects (although, I realize that would require user teaching and better ways to iterator a single axis).
Whatever the solution, maybe a new name is fine, maybe one should just keep the tolist name but make raveled/flattened=True/False
compulsory (i.e. it is undefined if not passed and the "minimal" implementation used for testing would raise).
This issue made me wonder about converting from one namespace to another. Say from PyTorch to Numpy. This works:
x = array_api_compat.torch.asarray([1,2,3])
array_api_compat.numpy.asarray(x) # -> array([1, 2, 3])
The reason I was thinking about this was that it would be nice to have a consistent way of converting things. Of course, there is no asarray
for normal Python, so this is more of a thought experiment.
We did discuss a possibility to standardize bringing data from any array object to Python. It would make sense to have a function that would transfer content of array into another type that exposes Python buffer protocol. From here the content could be converted to NumPy, or passed to xp.asarray
in another library.
I think there was a misunderstanding... I do not want a list returned for 0D arrays, but a dtype- independent way to convert them to a Python object that can hold them, that works also for non-standard dtypes.
Non-standard dtypes may not have a pure Python equivalent, so that's quite tricky clearly. Things like datetime
may work, for different precisions like float128
it's hard to determine whether it's fine to downcast to 64-bit float
's, etc. I don't think there should be a too magical do-it-all function. The current issues with numpy's .tolist
show that that's a problem. It's easy enough to write for 0-D arrays with the set of dtypes that you care about something like:
def convert_0D_arrays(x):
if not x.ndim == 0:
raise ValueError('...')
if xp.isdtype(x, 'real floating'):
return float(x)
elif xp.isdtype(x, 'complex floating'):
return complex(x)
# etc.
Static typing is also easier outside of a magical do-it-all function, because you can add the overloads for the different return types.
The reason I was thinking about this was that it would be nice to have a consistent way of converting things.
This can be done with from_dlpack
, or with asarray
.
Non-standard dtypes may not have a pure Python equivalent, so that's quite tricky clearly. Things like
datetime
may work, for different precisions likefloat128
it's hard to determine whether it's fine to downcast to 64-bitfloat
's, etc. I don't think there should be a too magical do-it-all function. The current issues with numpy's.tolist
show that that's a problem. It's easy enough to write for 0-D arrays with the set of dtypes that you care about something like:def convert_0D_arrays(x): if not x.ndim == 0: raise ValueError('...') if xp.isdtype(x, 'real floating'): return float(x) elif xp.isdtype(x, 'complex floating'): return complex(x) # etc.
The thing is that if the arrays implement additional dtypes apart from the standard ones (something allowed in the standard, as far as I know), this is not so easy to do from the user side. Consider for example the datetime
extension you mentioned. There is no dunder equivalent like __float__
for datetime. Thus:
datetime(x)
will likely not work.memoryview
.So, what can a user do to retrieve the object?
And note that although it is valid to say "this is not a problem for the standard, as it only requires numerical types with dunder methods", I still see the value in standardizing at least the naming and interface of a function similar to NumPy item
that provides the "best" way to represent a scalar quantity as a Python object, as intended by the array developers. Depending on how that is standardized, it could even be the case that the value returned for a, for example, float64
dtype is not a Python float. For example, a library that wraps arrays compatible with the standard and adds physical units on top, presenting itself a similar API to the standard, could implement item
as returning objects with units attached.
I still see the value in standardizing at least the naming and interface of a function similar to NumPy item that provides the "best" way to represent a scalar quantity as a Python object,
There's a reasonable amount of consensus, among both NumPy devs and devs from other array libraries, that NumPy's scalars were a design mistake. They add a large amount of complexity, and we'd remove them from NumPy if we could (but, backwards compat). So I don't think this is going to fly.
Consider for example the datetime extension you mentioned. There is no dunder equivalent like float for datetime.
There is only one library that supports datetime dtypes, namely NumPy. So you can explicitly handle that case with a NumPy function.
Just thinking out loud... If we agree that a "0D list" is an ill-defined construct, perhaps we can at least have a clean .tolist()
semantics on unambiguous cases? From a purist perspective, in addition to always get a return value of type list
, it's also very good to try and preserve the dimension/shape of an array, for facilitating a correct round trip (tolist
-> asarray
, or vice versa).
.tolist()
returns an N-nested Python list, with lengths of the inner lists dictated by .shape
.shape
= (2,)
, output = [1, 2]
.shape
= (2, 3)
, output = [[1, 2, 3], [2, 3, 4]]
.tolist()
must assume C order .shape
= (0,)
, output = []
.shape
= (2, 0)
, output = [[], []]
.shape
= (2, 3, 0)
, output = [[[], [], []], [[], [], []]]
.shape
= (0, 2)
.shape
= (2, 0, 4)
.shape
= (0, 3, 4)
int()
, float()
, ... to get a Python scalar), That seems reasonable @leofang. However, there are going to be other exceptions aside from 0-D arrays, because leaving array land isn't always possible. E.g., what about non-CPU devices or detaching from an autograd graph?
In case that this is implemented I would rather have the natural implementation for 0D arrays: returning a non-list (either a Python representation of the scalar value itself or the array unchanged) so that array(0).tolist() == array([0]).tolist()[0]
. If the name tolist
is considered problematic for something that does not necessarily return a list, I would change the name, rather than raising an exception for a case where the natural behavior is obvious.
Adding to
@betatim
This issue made me wonder about converting from one namespace to another. Say from PyTorch to Numpy. This works:
x = array_api_compat.torch.asarray([1,2,3]) array_api_compat.numpy.asarray(x) # -> array([1, 2, 3])
The reason I was thinking about this was that it would be nice to have a consistent way of converting things. Of course, there is no
asarray
for normal Python, so this is more of a thought experiment.
and
@oleksandr-pavlyk
We did discuss a possibility to standardize bringing data from any array object to Python. It would make sense to have a function that would transfer content of array into another type that exposes Python buffer protocol. From here the content could be converted to NumPy, or passed to
xp.asarray
in another library.
remarks,
should we use a separate issue that covers inter-namespace conversion specifically rather than tolist ?
I want to emphasize with this usecase I have when trying to adapt code for Array API compliance. There is some code that can't compromise on numerical accuracy and absolutly requires at least float64 precision, but I could use an integrated GPU that supports at most float32 (e.g using mps
or xpu
backends using pytorch) for everything else. For this I would have to transfer data from device to cpu, run the float64 compute, and transfer back to device. But .to_device("cpu")
, this is not part of the standard and some array libraries might not support it (like cupy arrays) so I can't rely on it. .from_dlpack
does not support inter-device conversion so it's not appropriate either.
For this usecase an intermediate object that enable inter-device and inter-namespace conversion surely would be practical.
tolist
have been mentionned but also conversion to and from numpy is commonly supported:
Tensor.numpy
and torch.from_numpy
cupy
have cupy.asnumpy
and cupy.asarray
works with numpy arraysjax.numpy.array
and np.asarray
works with jax arraysfrom_array
supports numpy inputs and np.asarray
works with dask arraysTensor.numpy
and tf.convert_to_tensor()
NDArray.asnumpy
and array
supports numpy inputsasnumpy
and dpctl.tensor.asarray
supports numpy inputswouldn't it be practical to add to the Array API a conversion to numpy, e.g to_numpy
or asnumpy
? (from_numpy
doesn't seem as necessary since asarray
or from_dlpack
commonly already works with numpy inputs)
But .to_device("cpu"), this is not part of the standard and some array libraries might not support it (like cupy arrays) so I can't rely on it.
Related discussion https://github.com/data-apis/array-api/issues/626
@fcharras thanks for your thoughts! I've copied your comment to gh-626, so we can keep that "to host" topic there, and keep this one for .tolist
.
I think (correct me if I am mistaken) that currently the only way to convert an array object back to a Python representation is to call
float
,int
,bool
, etc on 0D arrays. This requires that the user knows the appropriate function to call and does not offer any standard way to retrieve the underlying Python object when the library has additional dtypes, such asobject
in NumPy.Moreover, as there is no
tolist
in the standard, it is also not possible to obtain a list representation of the array (from which the Python object could be retrieved).I propose to add
tolist
to the standard, as defined in NumPy and Pytorch to deal with these cases. Although the name is a bit misleading (because for 0D arrays there is no list at all), I think that prior art justifies reusing that name.