data-apis / array-api

RFC document, tooling and other content related to the array API standard
https://data-apis.github.io/array-api/latest/
MIT License
204 stars 42 forks source link

RFC: `item()` to return scalar for arrays with exactly 1 element. #815

Open randolf-scholz opened 1 week ago

randolf-scholz commented 1 week ago
def item(self) -> Scalar:
     """If array contains exactly one element, retun it as a scalar, else raises ValueError."""

Examples:

Demo:

import pytest
import xarray as xr
import pandas as pd
import polars as pl
import numpy as np

@pytest.mark.parametrize("data", [[], [1, 2, 3]])
@pytest.mark.parametrize(
    "array_type", [torch.tensor, np.array, pd.Series, pd.Index, pl.Series, xr.DataArray]
)
def test_item_valueerror(data, array_type):
    array = array_type(data)
    with pytest.raises(ValueError):
        array.item()

@pytest.mark.parametrize(
    "array_type", [torch.tensor, np.array, pd.Series, pd.Index, pl.Series, xr.DataArray]
)
def test_item(array_type):
    array = array_type([1])
    array.item()

Currently, only torch fails, because it raises RuntimeError instead of ValueError.

vnmabus commented 1 week ago

This was discussed in #710 , along with the more general to_list, which works also for ND arrays.

randolf-scholz commented 1 week ago

item() is a bit different from to_list, and honestly I find it confusing that a method named to_list can return something that is not a list.

rgommers commented 1 week ago

.item() is more constrained than to_list indeed, and a bit cleaner. I checked other libraries - NumPy, PyTorch, JAX and CuPy implement .item(), Dask does not. (TF doesn't have it in the docs, so probably also not - but I can't check). CuPy/JAX do the transfer to CPU if the ndarray is on GPU.

This is a minor convenience method though, since float() & co work as well. They are clearer, since type-stable, and it also work for Dask. The only downside is that if you want some dtype-generic implementation to return a single element, you have to write a little utility for it to call int/float/complex/bool as appropriate. Something like:

def as_pyscalar(x):
    if xp.isdtype(x, 'real floating'):
        return float(x)
    elif xp.isdtype(x, 'complex floating'):
        return complex(x)
    elif xp.isdtype(x, 'integral'):
        return int(x)
    elif xp.isdtype(x, 'bool'):
        return bool(x)
    else:
        # raise error, or handle custom/non-standard dtypes if desired

Static typing of such a function, and of .item(), would also be a little annoying as it requires overloads.

asmeurer commented 1 week ago

item also works on arrays with multiple dimensions, whereas we decided to make it so float does not.

>>> np.array([1]).item()
1
rgommers commented 5 days ago

We discussed this in a call today, and concluded that this fell into a bucket of functionality that is useful, but also easy to implement on top of what's already in the standard. In addition, there are problems with trying to add this: a item() method is hard, because it's missing in some libraries and missing methods cannot be worked around in array-api-compat. If we'd do this, a function would be the way to go - but since that's not present in any libraries, it'd be new - hence more work, and likely to incur resistance from array library maintainers.

Outcome:

  1. Create the array-api-extra package where this kind of function can live, and add it there (probably as as_pyscalar or a similarly descriptive name, not as item)
  2. Only reconsider adding it to the standard itself in the future if most/all array libraries have already added that function.
randolf-scholz commented 4 days ago

On a very fundamental level, I believe .item() makes no sense on DataFrame-like objects (pandas.DataFrame, polars.DataFrame, pyarrow.Table, etc.) because these are designed to represent heterogeneous data types.

From a mathematical PoV, item() acts on array-like data with homogeneous type, as a representation of the natural isomorphism V →K, when V is a 1-dimensional vector space over K.