Open asfimport opened 2 years ago
Joris Van den Bossche / @jorisvandenbossche: To give a concrete copy-pastable example (using the one from the docs: https://arrow.apache.org/docs/dev/python/extending_types.html#custom-scalar-conversion):
from collections import namedtuple
import pyarrow as pa
Point3D = namedtuple("Point3D", ["x", "y", "z"])
class Point3DScalar(pa.ExtensionScalar):
def as_py(self) -> Point3D:
return Point3D(*self.value.as_py())
class Point3DType(pa.PyExtensionType):
def __init__(self):
pa.PyExtensionType.__init__(self, pa.list_(pa.float32(), 3))
def __reduce__(self):
return Point3DType, ()
def __arrow_ext_scalar_class__(self):
return Point3DScalar
storage = pa.array([[1, 2, 3], [4, 5, 6]], pa.list_(pa.float32(), 3))
arr = pa.ExtensionArray.from_storage(Point3DType(), storage)
>>> arr.to_pandas().values
array([array([1., 2., 3.], dtype=float32),
array([4., 5., 6.], dtype=float32)], dtype=object)
>>> arr.to_pylist()
[Point3D(x=1.0, y=2.0, z=3.0), Point3D(x=4.0, y=5.0, z=6.0)]
So here, to_pylist
gives the nice scalars, while in to_pandas()
, we have the raw numpy arrays from converting the storage list array.
We could do this automatically in to_pandas
as well if we detect that the ExtensionType raises NotImplementedError for to_pandas_dtype
and returns a subclass from \_\_arrow_ext_scalar_class\_\_
.
On the other hand, you can also do this yourself by overriding to_pandas()
?
And what about to_numpy()
?
Rok Mihevc / @rok: As a user I would like to have an opt-in 'no thinking' route and an obvious way to override if needed.
Joris Van den Bossche / @jorisvandenbossche:
@rok but what is your preferred no-thinking route? Is that to use Scalar.as_py()
if you define that (and then convert to object dtype Series in pandas?), or to use the storage array conversion?
Rok Mihevc / @rok:
I suppose as_py
as the overridable "thinking" route and storage array conversion as "no-thinking" (although that is not explicitly opt-in).
Chang She / @changhiskhan: My head hurts trying to keep it all straight:
so we have:
Some of these are defined/performed in C++ and others in Python
hard to think how to give devs clear guidance on the order of things
This was raised in ARROW-17813 by @changhiskhan:
and I also mentioned this in ARROW-17535:
Reporter: Joris Van den Bossche / @jorisvandenbossche Watchers: Rok Mihevc / @rok
Related issues:
Note: This issue was originally created as ARROW-17925. Please see the migration documentation for further details.