data-apis / array-api

RFC document, tooling and other content related to the array API standard
https://data-apis.github.io/array-api/latest/
MIT License
205 stars 42 forks source link

Handling materialization of lazy arrays #748

Open hameerabbasi opened 4 months ago

hameerabbasi commented 4 months ago

Background

Some colleagues and me were doing some work on sparse when we stumbled onto a limitation of the current Array API Standard, and @kgryte was kind enough to point out that it might have some wider implications than just sparse, so it would be prudent to discuss it with other relevant parties within the community before settling on an API design to avoid fragmentation.

Problem Statement

There are two notable things missing from the Array API standard today, which sparse, and potentially Dask, JAX and other relevant libraries might also need.

Potential solutions

Overload the Array.device attribute and the Array.to_device method.

One option is to overload the objects returned/accepted by these to contain a device + storage object. Something like the following:

class Storage:
    @property
    def device(self) -> Device:
        ...

    @property
    def format(self) -> Format:
        ...

    def __eq__(self, other: "Storage") -> bool:
        """ Compatible if combined? """

    def __ne__(self, other: "Storage") -> bool:
        """ Incompatible if combined? """

class Array:
    @property
    def device(self) -> Storage:
        ...

    def to_device(self, device: Storage, ...) -> "Array":
        ...

To materialize an array, one could use to_device(default_device()) (possible after #689 is merged).

Advantages

As far as I can see, it's compatible with how the Array API standard works today.

Disadvantages

We're mixing the concepts of an execution context and storage format, and in particular overloading operators in a rather weird way.

Introduce an Array.format attribute and Array.to_format method.

Advantages

We can get the API right, maybe even introduce xp.can_mix_formats(...).

Disadvantages

Would need to wait till the 2024 revision of the standard at least.

Tagging potentially interested parties:

leofang commented 4 months ago

I think this topic will have to be addressed in v2024, as it's too big to be squeezed in v2023 which we're trying very hard to wrap up πŸ˜…

rgommers commented 4 months ago

A few quick comments:

hameerabbasi commented 4 months ago

I think this topic will have to be addressed in v2024, as it's too big to be squeezed in v2023 which we're trying very hard to wrap up πŸ˜…

No pressure. πŸ˜‰

Materialization via some function/method in the API that triggers compute would be the one thing that is possibly actionable. However, that is quite tricky. The page I linked above has a few things to say about it.

Thanks Ralf -- That'd be a big help indeed. Materializing an entire array as opposed to one element is something that should be a common API across libraries, IMHO, I changed the title to reflect that.

kgryte commented 3 months ago

Cross linking https://github.com/data-apis/array-api/issues/728 as it may be relevant to this discussion.

adityagoel4512 commented 1 week ago

Materializing an entire array as opposed to one element is something that should be a common API across libraries, IMHO,

Just wanted to point out that it may be common but not universal. For instance, ndonnx arrays may not have any data that can be materialized. Such arrays do have data types and shapes and enable instant ONNX export of Array API compatible code. ONNX models are serializable computation graphs that you can load later, and so these "data-less" arrays denote model inputs that can be supplied at an entirely different point in time (in a completely different environment).

There are some inherently eager functions like __bool__ where we just raise an exception if there is no materializable data, in line with the standard. Any proposals around "lazy" arrays collecting values should have some kind of escape hatch like this.