Closed paleolimbot closed 5 months ago
Do we want to go the route of a registry and having users define their own?
For nanoarrow, I would personally stick to what it in essence is: metadata (and we have the advantage that here the "data type" (ArrowSchema) of an array actually has this field metadata, and so we don't have the problem like in pyarrow that the Array object looses that information)
(we could of course still provide some more ergonomic access to the extension name/metadata, e.g. by detecting if those keys are present, and in that case showing them more prominent in certain reprs, or provide an easier getter property for them, etc)
Do we want to go the route of a registry and having users define their own?
Great point! I'm still feeling my way through how users should interact with this.
The part where the concept of a registry is difficult to avoid (global or tightly scoped) is when doing conversion to/from Python objects with nested type support. Without a registry, how would one produce/consume a struct<geoarrow.polygon, uuid>
without inefficiently bouncing through deeply-nested Python output or without every package that does consuming explicitly opting in to every possible extension type?
The part where the concept of a registry is difficult to avoid (global or tightly scoped) is when doing conversion to/from Python objects with nested type support.
The user has always cheap access to the storage type, or the raw data, and can do a custom conversion from that? For the example of Polygons, I wouldn't want some iterative approach to convert to python objects, I would just want to have access to the underlying coordinates and offsets to do a custom bulk conversion.
I personally would punt on this problem of custom conversion for now, and see what kind of needs arise (we don't even support this in pyarrow at the moment)
If some kind of registry is needed specifically for the conversion, the sqlite interface could be an inspiration? (not necessarily the exact names, because I find those a bit confusing. But the fact that you register just something for conversion from/to python, and that also doesn't necessarily need to be limited to extension types. Users might want to influence how for example a map or interval type is converted to python)
Users might want to influence how for example a map or interval type is converted to python)
Right now that's a little awkward, but possible:
import nanoarrow as na
import pyarrow as pa
from nanoarrow import iterator
import decimal
decimals = pa.array([decimal.Decimal("1.234")], pa.decimal128(10, 3))
class CustomIterator(iterator.PyIterator):
def _decimal_iter(self, offset, length):
for item in super()._decimal_iter(offset, length):
yield None if item is None else float(item)
list(CustomIterator.get_iterator(decimals))
#> [1.234]
I personally would punt on this problem of custom conversion for now
Agreed! I think I've scoped it now to the basic usage of if schema.extension and schema.extension.name == ...
.
(we don't even support this in pyarrow at the moment)
I'm not sure exactly where it is used, but you can add a custom Scalar
subclass that overrides as_py()
?
Thank you for taking a look!
The initial motivation of this PR was to ensure that extension types are handled in nanoarrow's Schemas; however, this exposed that metadata was not handled by the
Schema
in either create or consume mode.After this PR, extension types can be created and inspected. This is really just creating a schema with some metadata (and looking for specific metadata when consuming).
In doing some testing, there were a number of places where extension schemas/arrays were implicitly treated as their storage types. I've tried to error or warn for these cases as much as possible:
It's a little hard to create an extension array at the moment (and there should probably be a similar option to strip the extension type from an
Array
to just get the storage), but I think that is maybe a job for another PR that is more about arrays and less about schemas:A side effect of all of this is that there is better support for modifying schemas: