apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.5k stars 3.53k forks source link

[Python] Use ExtensionScalar.as_py() as fallback in ExtensionArray to_pandas? #33134

Open asfimport opened 2 years ago

asfimport commented 2 years ago

This was raised in ARROW-17813 by @changhiskhan:

ExtensionArray => pandas

Just for discussion, I was curious whether you had any thoughts around using the extension scalar as a fallback mechanism. It's a lot simpler to define an ExtensionScalar with as_py than a pandas extension dtype. So if an ExtensionArray doesn't have an equivalent pandas dtype, would it make sense to convert it to just an object series whose elements are the result of as_py?

and I also mentioned this in ARROW-17535:

That actually brings up a question: if an ExtensionType defines an ExtensionScalar (but not an associciated pandas dtype, or custom to_numpy conversion), should we use this scalar's as_py() for the to_numpy/to_pandas conversion as well for plain extension arrays? (not the nested case)

Because currently, if you have an ExtensionArray like that (for example using the example from the docs: https://arrow.apache.org/docs/dev/python/extending_types.html#custom-scalar-conversion), we still use the storage type conversion for to_numpy/to_pandas, and only use the scalar's conversion in to_pylist.

Reporter: Joris Van den Bossche / @jorisvandenbossche Watchers: Rok Mihevc / @rok

Related issues:

Note: This issue was originally created as ARROW-17925. Please see the migration documentation for further details.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: To give a concrete copy-pastable example (using the one from the docs: https://arrow.apache.org/docs/dev/python/extending_types.html#custom-scalar-conversion):


from collections import namedtuple
import pyarrow as pa

Point3D = namedtuple("Point3D", ["x", "y", "z"])

class Point3DScalar(pa.ExtensionScalar):
    def as_py(self) -> Point3D:
        return Point3D(*self.value.as_py())

class Point3DType(pa.PyExtensionType):
    def __init__(self):
        pa.PyExtensionType.__init__(self, pa.list_(pa.float32(), 3))

    def __reduce__(self):
        return Point3DType, ()

    def __arrow_ext_scalar_class__(self):
        return Point3DScalar

storage = pa.array([[1, 2, 3], [4, 5, 6]], pa.list_(pa.float32(), 3))
arr = pa.ExtensionArray.from_storage(Point3DType(), storage)

>>> arr.to_pandas().values
array([array([1., 2., 3.], dtype=float32),
       array([4., 5., 6.], dtype=float32)], dtype=object)

>>> arr.to_pylist()
[Point3D(x=1.0, y=2.0, z=3.0), Point3D(x=4.0, y=5.0, z=6.0)]

So here, to_pylist gives the nice scalars, while in to_pandas(), we have the raw numpy arrays from converting the storage list array.

We could do this automatically in to_pandas as well if we detect that the ExtensionType raises NotImplementedError for to_pandas_dtype and returns a subclass from \_\_arrow_ext_scalar_class\_\_.

On the other hand, you can also do this yourself by overriding to_pandas()?

And what about to_numpy()?

asfimport commented 2 years ago

Rok Mihevc / @rok: As a user I would like to have an opt-in 'no thinking' route and an obvious way to override if needed.

asfimport commented 2 years ago

Joris Van den Bossche / @jorisvandenbossche: @rok but what is your preferred no-thinking route? Is that to use Scalar.as_py() if you define that (and then convert to object dtype Series in pandas?), or to use the storage array conversion?

asfimport commented 2 years ago

Rok Mihevc / @rok: I suppose as_py as the overridable "thinking" route and storage array conversion as "no-thinking" (although that is not explicitly opt-in).

asfimport commented 2 years ago

Chang She / @changhiskhan: My head hurts trying to keep it all straight:

so we have: