data-apis / array-api

RFC document, tooling and other content related to the array API standard
https://data-apis.github.io/array-api/latest/
MIT License
212 stars 44 forks source link

Protocol for array objects #229

Open pmeier opened 3 years ago

pmeier commented 3 years ago

I've tried to tackle static typing and got a vendorable protocol that can be checked statically as well as at runtime for all but one case I'm going to detail below.

Protocol

```python import enum from typing import Any, Optional, Protocol, Tuple, TypeVar, Union, runtime_checkable A = TypeVar("A") @runtime_checkable class VendoredArrayProtocol(Protocol[A]): @property def dtype(self) -> Any: ... @property def device(self) -> Any: ... @property def ndim(self) -> int: ... @property def shape(self) -> Any: ... @property def size(self) -> int: ... @property def T(self) -> A: ... def __abs__(self) -> A: ... def __add__(self, other: Union[int, float, A], /) -> A: ... def __and__(self, other: Union[bool, int, A], /) -> A: ... def __array_namespace__(self, /, *, api_version: Optional[str] = None) -> Any: ... def __bool__(self) -> bool: ... def __dlpack__(self, /, *, stream: Optional[Union[int, Any]] = None) -> Any: ... def __dlpack_device__(self) -> Tuple[enum.IntEnum, int]: ... # This overrides the input type, since object.__eq__ handles any input # This overrides the return type, since object.__eq__ returns a bool def __eq__( # type: ignore[override] self, other: Union[bool, int, float, A], /, ) -> A: # type: ignore[override] ... def __float__(self) -> float: ... def __floordiv__(self, other: Union[int, float, A], /) -> A: ... def __ge__(self, other: Union[int, float, A], /) -> A: ... def __getitem__( self, key: Union[int, slice, Tuple[Union[int, slice], ...], A], /, ) -> A: ... def __gt__(self, other: Union[int, float, A], /) -> A: ... def __int__(self) -> int: ... def __invert__(self) -> A: ... def __le__(self, other: Union[int, float, A], /) -> A: ... def __len__(self) -> int: ... def __lshift__(self, other: Union[int, A], /) -> A: ... def __lt__(self, other: Union[int, float, A], /) -> A: ... def __matmul__(self, other: A) -> A: ... def __mod__(self, other: Union[int, float, A], /) -> A: ... def __mul__(self, other: Union[int, float, A], /) -> A: ... # This overrides the input type, since object.__ne__ handles any input # This overrides the return type, since object.__ne__ returns a bool def __ne__( # type: ignore[override] self, other: Union[bool, int, float, A], / ) -> A: # type: ignore[override] ... def __neg__(self) -> A: ... def __or__(self, other: Union[bool, int, A], /) -> A: ... def __pos__(self) -> A: ... def __pow__(self, other: Union[int, float, A], /) -> A: ... def __rshift__(self, other: Union[int, A], /) -> A: ... def __setitem__( self, key: Union[int, slice, Tuple[Union[int, slice], ...], A], value: Union[bool, int, float, A], /, ) -> None: ... def __sub__(self, other: Union[int, float, A], /) -> A: ... def __truediv__(self, other: Union[int, float, A], /) -> A: ... def __xor__(self, other: Union[bool, int, A], /) -> A: ... ```

To test everything yourself you can use this playground repo.

Current blocker

It is currently impossible to use Ellipsis in type annotations, since its alias ... has a different meaning there. Thus, it is currently impossible to correctly annotate the __getitem__ and __setitem__ methods. There is a fix for this in python/cpython/#22336, but it will only be shipped with Python 3.10. If we leave it out of the annotation, accessing the array with something like Array()[..., 0] will be flagged by mypy although it should be supported according to the specification.

Suggestes improvements

While working on the protocol I found a few issues that could be addressed:

BvB93 commented 3 years ago

In general this seem like a perfect case for the use of a Protocol, so +1!

Based on my experience in the typing of numpy I do have a couple of comments:

It is currently impossible to use Ellipsis in type annotations, since its alias ... has a different meaning there.

Currently there exists the type-check-only builtins.ellipsis class that is used for this exact purpose. It's public status is somewhat questionable (https://github.com/python/typeshed/issues/3556), but I'd argue that some pragmatism is required here, as it is the only way of typing the ... singleton as of the moment.

def __eq__(  # type: ignore[override]
    self,
    other: Union[bool, int, float, A],
    /,
) -> A:  # type: ignore[override]
    ...

While # type: ignore fixes the mypy error, it does not get rid of a more fundamental problem: nearly every single object supports comparison due to object.__eq__. https://github.com/numpy/numpy/issues/17368#issuecomment-710110381 contains a more complete example, but the general gist of it is that we can only describe what happens with VendoredArrayProtocol() == FooType(), but have zero control over FooType() == VendoredArrayProtocol() as this the domain of FooType.__eq__ and/or object.__eq__.

The binary dunder methods take a specific input types for the other parameter. For example __add__ takes Union[int, float, Array]. IMO they should take Any and return NotImplemented in case they cannot work with the type. For example:

As a reminder: for A() + B() to work, the __add__ method of both types does not have to be symmetric. so even if A.__add__ does not support B as argument, everything will work out fine as long as B.__add__ does support A. A practical example of this in practice would be list.__mul__ and int.__mul__, the former supporting the latter but not the other way around.

Array.shape should return Tuple[int, ...], but explicitly check for Python scalars in constant tests array-api-tests#15 (comment) implies that custom objects might also be possible. Maybe we can use Sequence[int]?

On the subjects of shapes I'd like to point everyone's attention to PEP 646: Variadic Generics. Namelly, it'd allow us to define a single shape type that takes a variable number of parameters (i.e. axis lengths), just like tuple can be Tuple[int], Tuple[int, int], Tuple[int, int, int], etc.. With this mind I'd suggest to either stick to Tuple, or wait until PEP 646 is live and then defined a custom variadic Sequence type or protocol.

Array.dtype, Array.device, Array.__array_namespace__(), and Array.__dlpack__() should return custom objects, but it is not specified how these objects "look like". In the current state of the protocol I've typed them as Any, but the specification should be more precise.

For __dlpack__ specifically it should be quite easy to create a proper return type if we can get builtins.Pycapsule annotated in typeshed (e.g. https://github.com/numpy/numpy/pull/19083#discussion_r638116200). If not then it is always possible to create a custom type, but then we would have the issue that it'd have to be distributed from a single central place.

pmeier commented 3 years ago

Currently there exists the type-check-only builtins.ellipsis class that is used for this exact purpose.

Awesome, I didn't know or find that. This unblocks this and we could add the protocol to the specification. Current iteration:

Protocol

```python import enum from typing import ( TYPE_CHECKING, Any, Optional, Protocol, Tuple, TypeVar, Union, runtime_checkable, ) if TYPE_CHECKING: from builtins import ellipsis A = TypeVar("A") @runtime_checkable class ArrayProtocol(Protocol[A]): @property def dtype(self) -> Any: ... @property def device(self) -> Any: ... @property def ndim(self) -> int: ... @property def shape(self) -> Tuple[int, ...]: ... @property def size(self) -> int: ... @property def T(self) -> A: ... def __abs__(self) -> A: ... def __add__(self, other: Union[int, float, A], /) -> A: ... def __and__(self, other: Union[bool, int, A], /) -> A: ... def __array_namespace__(self, /, *, api_version: Optional[str] = None) -> Any: ... def __bool__(self) -> bool: ... def __dlpack__(self, /, *, stream: Optional[Union[int, Any]] = None) -> Any: ... def __dlpack_device__(self) -> Tuple[enum.IntEnum, int]: ... # This overrides the input type, since object.__eq__ handles any input # This overrides the return type, since object.__eq__ returns a bool def __eq__( # type: ignore[override] self, other: Union[bool, int, float, A], /, ) -> A: # type: ignore[override] ... def __float__(self) -> float: ... def __floordiv__(self, other: Union[int, float, A], /) -> A: ... def __ge__(self, other: Union[int, float, A], /) -> A: ... def __getitem__( self, key: Union[ int, slice, "ellipsis", Tuple[Union[int, slice, "ellipsis"], ...], A, ], /, ) -> A: ... def __gt__(self, other: Union[int, float, A], /) -> A: ... def __int__(self) -> int: ... def __invert__(self) -> A: ... def __le__(self, other: Union[int, float, A], /) -> A: ... def __len__(self) -> int: ... def __lshift__(self, other: Union[int, A], /) -> A: ... def __lt__(self, other: Union[int, float, A], /) -> A: ... def __matmul__(self, other: A) -> A: ... def __mod__(self, other: Union[int, float, A], /) -> A: ... def __mul__(self, other: Union[int, float, A], /) -> A: ... # This overrides the input type, since object.__ne__ handles any input # This overrides the return type, since object.__ne__ returns a bool def __ne__( # type: ignore[override] self, other: Union[bool, int, float, A], / ) -> A: # type: ignore[override] ... def __neg__(self) -> A: ... def __or__(self, other: Union[bool, int, A], /) -> A: ... def __pos__(self) -> A: ... def __pow__(self, other: Union[int, float, A], /) -> A: ... def __rshift__(self, other: Union[int, A], /) -> A: ... def __setitem__( self, key: Union[ int, slice, "ellipsis", Tuple[Union[int, slice, "ellipsis"], ...], A, ], value: Union[bool, int, float, A], /, ) -> None: ... def __sub__(self, other: Union[int, float, A], /) -> A: ... def __truediv__(self, other: Union[int, float, A], /) -> A: ... def __xor__(self, other: Union[bool, int, A], /) -> A: ... ```

As a reminder: for A() + B() to work, the __add__ method of both types does not have to be symmetric. so even if A.__add__ does not support B as argument, everything will work out fine as long as B.__add__ does support A.

This is only true if B implements __radd__:

class B:
    def __add__(self, other):
        return "B"

class C(B):
    def __radd__(self, other):
        return self.__add__(other)

class A:
    def __add__(self, other):
        if isinstance(other, B):
            return NotImplemented

        return "A"

a = A()
b = B()
c = C()

print(a + c)
print(a + b)
B
TypeError: unsupported operand type(s) for +: 'A' and 'B'
BvB93 commented 3 years ago

This is only true if B implements __radd__:

Fair enough, I implicitly assumed that both __add__ and __radd__ would be defined. Nevertheless, my point here is that there is no need here for Any as long as a custom user type has properly annotated its __add__ and __radd__ methods (which is the users responsibility).

class A:
    def __add__(self, other: A) -> A: ...
    def __radd__(self, other: A) -> A: ...

class B:
    def __add__(self, other: A | B) -> B: ...
    def __radd__(self, other: A | B) -> B: ...

a: A
b: B

reveal_type(a + a)  # Revealed type is '__main__.A'
reveal_type(a + b)  # Revealed type is '__main__.B'
reveal_type(b + a)  # Revealed type is '__main__.B'
reveal_type(b + b)  # Revealed type is '__main__.B'
asmeurer commented 3 years ago

The NotImplemented question came up in the NumPy PR as well (https://github.com/numpy/numpy/pull/18585#issuecomment-877503058).

My opinion is that the spec should not disallow returning NotImplemented but not require it. Libraries like NumPy have their own dispatching mechanism that isn't based on NotImplemented. I also think it should be out of scope for the spec to define how __eq__ works on unexpected types. An array library may want to work like NumPy where it absorbs things in __eq__, especially if it implements more dtypes than are in the spec, or it may want to return True/False (or NotImplemented), or just error. Any of those options are reasonable. I'm not sure how this affects typing, but it's also my opinion that the type hints should follow from the constraints of the spec, not the other way around.

These are just my opinions though, so maybe others will disagree.

rgommers commented 3 years ago

@pmeier can you summarize the issue with if every array library vendors this protocol, instead of putting it in a common package and inheriting from it? That still has a limitation, right?

My opinion is that the spec should not allow returning NotImplemented but not require it.

"not allow" --> "allow" ?

For __dlpack__ specifically it should be quite easy to create a proper return type if we can get builtins.Pycapsule annotated in typeshed (e.g. numpy/numpy#19083 (comment)). If not then it is always possible to create a custom type, but then we would have the issue that it'd have to be distributed from a single central place.

+1 to wait for typeshed here (or go help them along if it starts taking too long).

With this mind I'd suggest to either stick to Tuple, or wait until PEP 646 is live and then defined a custom variadic Sequence type or protocol.

Is there a problem with using Tuple[int, ...]? I'd think that that is a little more specific than plain Tuple, without it being a problem to change it to something even more specific later?

asmeurer commented 3 years ago

"not allow" --> "allow" ?

Yes that is what I meant.

pmeier commented 3 years ago

@rgommers

can you summarize the issue with if every array library vendors this protocol, instead of putting it in a common package and inheriting from it? That still has a limitation, right?

Not sure what you mean here. If this is about my earlier concern (that I only voiced offline) that typing.TypeVar only works for directly related classes, this is not blocking anymore. Although the documentation states

Alternatively, a type variable may specify an upper bound using bound=<type>. This means that an actual type substituted (explicitly or implicitly) for the type variable must be a subclass of the boundary type [emphasis mine]

it seems to work out fine with the vendored protocol, which by definition cannot be related to an actual array implementation. If you have a look at the playgorund repo and specifically this file, mypy is happy although the ArrayImplementation class is independent of VendoredArrayProtocol.

Is there a problem with using Tuple[int, ...]?

Yes, see https://github.com/data-apis/array-api-tests/pull/15#issuecomment-858591464. But this is not solved by variadic generics either IIUC.

pmeier commented 3 years ago

I've tested the protocol against numpy and apart from the device attribute and the array_namespace, dlpack, and dlpack_device dunders, numpy.ndarray's check fine.

BvB93 commented 3 years ago

Is there a problem with using Tuple[int, ...]?

Right, to clarify: the reason for excluding subscription was for the sake of creating a compact example.

can you summarize the issue with if every array library vendors this protocol, instead of putting it in a common package and inheriting from it? That still has a limitation, right?

Assuming we won't run into any bugs with type checkers, the sole "issue" here would be code duplication.

rgommers commented 3 years ago

Assuming we won't run into any bugs with type checkers, the sole "issue" here would be code duplication.

Good. I'm not bothered by code duplication, I'm pretty sure that all array libraries will prefer that over adding a runtime dependency. However, we still need a way for downstream consumers to access this protocol - downstream packages may be less happy to vendor, and end users even less. That could be solved by:

  1. A standalone package
  2. Each array library exposing its vendored copy

We anyway will need a canonical version somewhere to ensure vendored copies can be updated and remain in sync. So it may be preferable to go with (1).

rgommers commented 3 years ago

We had a discussion about this today. A standalone package may indeed be nice, and then we also have a place where other utilities or tests can be added (this has come up several times before, and we always tried to avoid a package because that's work to maintain - but here it seems like we can't avoid that anyway).

In terms of vendoring, it may be possible to avoid that by (in an array library) doing:

try:
   from array_pkgname import ArrayProtocol
except ImportError:
    # this is an optional dependency. static type checking against the Protocol
    # won't work for downstream users if the package is not installed
    class ArrayProtocol():
        pass

class ndarray(ArrayProtocol):
     # array object in this library (could also be named Array, Tensor, etc.)
    ...

A number of libraries don't support static typing at all yet (TensorFlow, CuPy), while some others (NumPy, PyTorch, MXNet) do. Then the only libraries that may want to vendor the protocol are the ones that want to use it in their own test suite.

honno commented 2 years ago

Related discussion (somehow totally missed this interesting issue agh)