apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.9k stars 4.27k forks source link

[Bug]: `Tuple` is confused for a `NamedTuple` across Python versions (3.9 vs 3.10). #24685

Open alxmrs opened 1 year ago

alxmrs commented 1 year ago

What happened?

To me, this is a really interesting corner case! Here's what's going on:

I'm updating an operator in XArray-Beam https://github.com/google/xarray-beam/pull/69. I have a type that I've defined as follows:

Chunk = Union[Tuple[Key, xarray.Dataset], Tuple[Key, Tuple[xarray.Dataset, ...]]]

In versions of Python before 3.10, this causes no problems. However, in 3.10 (and presumably onward), this type leads to a error:

ValueError: expecting type typing.Tuple[xarray_beam._src.core.Key, xarray.core.dataset.Dataset] to have arity 0, had arity 2 instead

I've looked a bit in depth at the sources, and I think this is the root cause of the issue: https://github.com/apache/beam/blob/4886bdf08fd057a0f8a4e99ab30453089654188d/sdks/python/apache_beam/typehints/native_type_compatibility.py#L108

This function, which checks if the tuple has annotations to test if it's a named tuple, doesn't behave as I'd expect. In Python 3.10, it treats Tuple as a NamedTuple (which it is not). Further, when I change the type annotation to tuple, no errors occur. It seems like the recent addition of type-annotations with built-ins will require how match_is_named_tuple is implemented:

As a workaround, users can use the tuple tuple annotation via __futures__ for Python 3.7 and later. https://peps.python.org/pep-0585/#implementation

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

tvalentyn commented 1 year ago

Thanks for reporting. We should look into this a part of Python 3.10 upgrade. cc: @AnandInguva @jrmccluskey FYI.

tvalentyn commented 1 year ago

actually, this is for 3.10, not 3.11

cisaacstern commented 1 year ago

Thanks for opening this issue and looking so deeply into its cause, @alxmrs.

Just chiming in to say that this issue is currently blocking us upgrading to Python 3.10 in Pangeo Forge, xref https://github.com/pangeo-forge/pangeo-forge-recipes/pull/470.

@tvalentyn are you open to PRs to fix this? I may be able to devote a little time in the coming weeks, if you don't already have someone looking into this.

jrmccluskey commented 1 year ago

I'm looking into this right now along with #23366 (which points out that we currently do not handle the built-in tuple correctly when used as a parameterized generic) and hope to have a design doc detailing improvements around this mailed out to the dev list shortly

cisaacstern commented 1 year ago

Thanks, @jrmccluskey!

jrmccluskey commented 1 year ago

I'm actually having a bit of trouble reproducing the error described here and in https://github.com/pangeo-forge/pangeo-forge-recipes/pull/470. The only 0-arity types we handle are typing.NewType, typing.ForwardRef, typing.Any, and typing.NamedTuple but I'm not able get a typing.Tuple[K, V] to trip the logic and fall into one of these categories instead of typing.Tuple.

For any parameterized definition of typing.Tuple the boolean check that would route it to a NamedTuple is hasattr(user_type, '__annotations__')) == true.

hmc-cs-mdrissi commented 9 months ago

I suspect simplest fix would be to add one more hasattr here like hasattr(obj, "_fields"). Every NamedTuple has _fields as a requirement and I've been using that for a while without issue.

edit: Oh I'm being dumb as I was thinking of an instance not NamedTuple itself. NamedTuple is extra weird as it's not actually a class, but a function in python 3.10.

edit 2:

>>> from typing import NamedTuple
>>> class Foo(NamedTuple):
...   x: int
...
>>> Foo
<class '__main__.Foo'>
>>> dir(Foo)
['__add__', '__annotations__', '__class__', '__class_getitem__', '__contains__', '__delattr__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__match_args__', '__module__', '__mul__', '__ne__', '__new__', '__orig_bases__', '__reduce__', '__reduce_ex__', '__repr__', '__rmul__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '_asdict', '_field_defaults', '_fields', '_make', '_replace', 'count', 'index', 'x']
>>> hasattr(Foo, '_fields')
True

So I think hasattr for type does work well too.