data-apis / array-api

RFC document, tooling and other content related to the array API standard
https://data-apis.github.io/array-api/latest/
MIT License
204 stars 42 forks source link

RFC: add support for determining the size of arrays in bytes #789

Open keewis opened 2 months ago

keewis commented 2 months ago

In trying to adapt xarray to numpy>=2 (and thus switching testing code from numpy.array_api to array-api-strict), I noticed that the array API does not require the nbytes property on arrays, nor the itemsize property on dtypes.

Thus, the only way to figure out the size of an array we could find was to create a function that dispatches to finfo / iinfo (and returns a hard-coded 1 byte for booleans), then use that and arr.size to compute the size of the array. This feels like more work than should be necessary, so I wonder if you would be open to extending the array API with arr.nbytes or arr.dtype.itemsize (or both)?

rgommers commented 2 months ago

Hi @keewis, I think it's partially a "no one asked for this" but perhaps partially on purpose too. For the example you give: there is no requirement that an array implements the bool dtype with 1 byte per element. It's conceivable that 1 bit is used (and in fact Arrow only has 1-bit bools). And .itemsize IIRC mismatches between some libraries, can be in bytes or bits (not 100% sure of this).

so I wonder if you would be open to extending the array API with arr.nbytes or arr.dtype.itemsize (or both)?

It seems reasonable - an array attribute probably more so than a dtype attribute, since dtypes are opaque objects (we know nothing aside from the name).

Can I ask what you are doing with the calculated size? Do you have internal logic for creating chunks based on array size or something like that?

keewis commented 2 months ago

Can I ask what you are doing with the calculated size?

Mainly for user information. Among other things, knowing the size of the variables in a newly opened dataset can help decide whether to involve chunked arrays (like dask or cubed) at all, or whether the whole array fits into memory and thus eager computation would be faster. For that reason, xarray has started to print the size of each array and the total size of all data variables in its reprs.

See pydata/xarray#8690 for some discussion of this (though maybe that is just evidence that nbytes is used a lot?).

kgryte commented 2 months ago

Another possibility would be adding a functional API for resolving the number of bytes, with some consideration for lazy arrays and arrays having non-deterministic shapes.