apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
169 stars 35 forks source link

feat(python): Implement extension type and Schema metadata support #431

Closed paleolimbot closed 5 months ago

paleolimbot commented 5 months ago

The initial motivation of this PR was to ensure that extension types are handled in nanoarrow's Schemas; however, this exposed that metadata was not handled by the Schema in either create or consume mode.

After this PR, extension types can be created and inspected. This is really just creating a schema with some metadata (and looking for specific metadata when consuming).

import nanoarrow as na

schema = na.extension_type(na.int32(), "arrow.example", b'{"some_param": 1234}')
if schema.type == na.Type.EXTENSION:
    print(schema.extension.name, schema.extension.metadata)
#> arrow.example b'{"some_param": 1234}'

In doing some testing, there were a number of places where extension schemas/arrays were implicitly treated as their storage types. I've tried to error or warn for these cases as much as possible:

import nanoarrow as na

ext = na.extension_type(na.int32(), "arrow.example")
na.Array([1, 2, 3], ext)
#> TypeError: ...
#> Can't create buffer from extension type arrow.example

It's a little hard to create an extension array at the moment (and there should probably be a similar option to strip the extension type from an Array to just get the storage), but I think that is maybe a job for another PR that is more about arrays and less about schemas:

import nanoarrow as na

ext = na.extension_type(na.int32(), "arrow.example")
storage = na.c_array([1, 2, 3], ext.extension.storage)
_, storage_capsule = storage.__arrow_c_array__()
extension = na.Array(storage_capsule, ext)
list(extension.iter_py())
#> UnregisteredExtensionWarning: <unnamed int32>: Converting unregistered extension 'arrow.example' as storage type
#> [1, 2, 3]

A side effect of all of this is that there is better support for modifying schemas:

import nanoarrow as na

na.Schema(na.int32(), name="some_col")
#> Schema(INT32, name='some_col')
schema = na.Schema(na.int32(), metadata={"some_key": "some_value"})
schema.metadata[b"some_key"]
#> b'some_value

na.c_schema(na.int32()).modify(
    name="some_col",
    metadata={"some_key": "some_value"},
    nullable=False
)
#> <nanoarrow.c_lib.CSchema int32>
#> - format: 'i'
#> - name: 'some_col'
#> - flags: 0
#> - metadata:
#>   - b'some_key': b'some_value'
#> - dictionary: NULL
#> - children[0]:
jorisvandenbossche commented 5 months ago

Do we want to go the route of a registry and having users define their own?

For nanoarrow, I would personally stick to what it in essence is: metadata (and we have the advantage that here the "data type" (ArrowSchema) of an array actually has this field metadata, and so we don't have the problem like in pyarrow that the Array object looses that information)

jorisvandenbossche commented 5 months ago

(we could of course still provide some more ergonomic access to the extension name/metadata, e.g. by detecting if those keys are present, and in that case showing them more prominent in certain reprs, or provide an easier getter property for them, etc)

paleolimbot commented 5 months ago

Do we want to go the route of a registry and having users define their own?

Great point! I'm still feeling my way through how users should interact with this.

The part where the concept of a registry is difficult to avoid (global or tightly scoped) is when doing conversion to/from Python objects with nested type support. Without a registry, how would one produce/consume a struct<geoarrow.polygon, uuid> without inefficiently bouncing through deeply-nested Python output or without every package that does consuming explicitly opting in to every possible extension type?

jorisvandenbossche commented 5 months ago

The part where the concept of a registry is difficult to avoid (global or tightly scoped) is when doing conversion to/from Python objects with nested type support.

The user has always cheap access to the storage type, or the raw data, and can do a custom conversion from that? For the example of Polygons, I wouldn't want some iterative approach to convert to python objects, I would just want to have access to the underlying coordinates and offsets to do a custom bulk conversion.

I personally would punt on this problem of custom conversion for now, and see what kind of needs arise (we don't even support this in pyarrow at the moment)

If some kind of registry is needed specifically for the conversion, the sqlite interface could be an inspiration? (not necessarily the exact names, because I find those a bit confusing. But the fact that you register just something for conversion from/to python, and that also doesn't necessarily need to be limited to extension types. Users might want to influence how for example a map or interval type is converted to python)

paleolimbot commented 5 months ago

Users might want to influence how for example a map or interval type is converted to python)

Right now that's a little awkward, but possible:

import nanoarrow as na
import pyarrow as pa
from nanoarrow import iterator
import decimal

decimals = pa.array([decimal.Decimal("1.234")], pa.decimal128(10, 3))

class CustomIterator(iterator.PyIterator):
    def _decimal_iter(self, offset, length):
        for item in super()._decimal_iter(offset, length):
            yield None if item is None else float(item)

list(CustomIterator.get_iterator(decimals))
#> [1.234]

I personally would punt on this problem of custom conversion for now

Agreed! I think I've scoped it now to the basic usage of if schema.extension and schema.extension.name == ....

(we don't even support this in pyarrow at the moment)

I'm not sure exactly where it is used, but you can add a custom Scalar subclass that overrides as_py()?

paleolimbot commented 5 months ago

Thank you for taking a look!