apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.64k stars 3.55k forks source link

[Python] Python extension types aren't usable in struct arrays #34985

Open spenczar opened 1 year ago

spenczar commented 1 year ago

Describe the bug, including details regarding any error messages, version, and platform.

I have a custom PyExtensionType. I would like to use that value for a field of an array, and use that array inside a larger data structure.

I'm able to use the extension type directly in pa.array, but if I use a pa.struct to wrap it up, it fails.

Here is a minimal reproducer. Running make_array_ok() does not error. Running make_struct_array_not_ok() results in an error.

import pyarrow as pa

class CustomExtensionType(pa.PyExtensionType):
    def __init__(self):
        pa.PyExtensionType.__init__(self, pa.int64())

    def __reduce__(self):
        return CustomExtensionType, ()

def make_array_ok():
    data = [1, 2, 3, 4, 5]
    typ = CustomExtensionType()
    return pa.array(data, typ)

def make_struct_array_not_ok():
    data = [{"val": 1}, {"val": 2}, {"val": 3}, {"val": 4}, {"val": 5}]
    typ = CustomExtensionType()
    struct_type = pa.struct([("val", typ)])
    return pa.array(data, type=struct_type)

The error is not very clear to me:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/redacted/struct_demo.py", line 22, in make_struct_array_not_ok
    return pa.array(data, type=struct_typ)
  File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: extension

Version info

Component(s)

Python

spenczar commented 1 year ago

Actually, making the array explicitly with StructArray.from_arrays works:

def make_struct_array_explicitly():
    data = [1, 2, 3, 4, 5]
    typ = CustomExtensionType()
    return pa.StructArray.from_arrays([pa.array(data, typ)], fields=[("val", typ)])

So I think this is really about pa.array's handling of structs that contain extension types.

NellyWhads commented 3 days ago

Is there anyone who can shine some light on this issue? It's turned into somewhat of a blocker for me. I need to construct a table with nested extension arrays insides structs using pyarrow