Add an ExtensionType to DataType enum

wjones127 commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Extension types are annotated in field metadata. This works well with record batches, but when exporting/importing an array over the C data interface, the extension type metadata is lost.

The C++ implementation solves this by having an ExtensionType class and always exports that metadata over C data interface here:

https://github.com/apache/arrow/blob/b9aec9ad2b655817b8925462e4e2dd6973807e23/cpp/src/arrow/c/bridge.cc#L243-L252

Describe the solution you'd like

I'd propose adding a new enum variant to DataType:

struct ExtensionType {
   name: String,
   metadata: String,
   storage_type: Box<DataType>,
}

enum DataType {
    ...
    ExtensionType(ExtensionType)
}

Then make sure the C data interface implementation handles exporting and importing this type.

Describe alternatives you've considered

We could add an extension type registry like C++, but that seems heavier than we really need.

Additional context

https://arrow.apache.org/docs/format/CDataInterface.html#extension-arrays

Previous discussions:

2444
218

tustvold commented 1 year ago

but when exporting/importing an array over the C data interface, the extension type metadata is lost

I might be missing something here, but why would it be lost, schema metadata should roundtrip over C data interface?

I'd propose adding a new enum variant to DataType:

My major objection to this approach is that it undermines the transparency of extension types. I feel quite strongly that only codepaths explicitly concerned with extension types should need concern themselves with them, for example the take or arithmetic kernel should not need to know about extension types. However, adding a DataType::Extension would instead force kernels to "see through" DataType::Extension when downcasting or performing operations such as extracting decimal precision. I would much prefer an approach that does not require this, by instead propagating extension metadata out-of-band.

wjones127 commented 1 year ago

I might be missing something here, but why would it be lost, schema metadata should roundtrip over C data interface?

This works well for RecordBatch, but not for an individual array transported independent from any batch. Basically, arrays themselves have no way to be tagged as an extension array, since those don't contain a field where that metadata is stored; they are only extension arrays in the context of a batch.

I feel quite strongly that only codepaths explicitly concerned with extension types should need concern themselves with them, for example the take or arithmetic kernel should not need to know about extension types.

I definitely agree, and don't want to make these operations more complex than they ought to be.

If we can think of another place to put this information, I'm open to that.

(A bit of a tangent, but...) In my ideal world, there would be a logical type enum and a physical type enum. Physical types would be the current DataType. Then logical types would be things like String (just one, regardless of offset size and encoding) and then a generic ExtensionType variant. Sort of like what Sasha was talking about a long time ago: https://lists.apache.org/thread/357z4587dczho4x1257ttf0b4o9302co

wjones127 commented 1 year ago

FWIW my workaround for now is to just wrap it in a record batch and unwrap on the other side. But it would be nice to find a way to not have to do that.

https://github.com/lancedb/lance/blob/3e1ed67acf7d1de336b6d211647d9581c64c3bed/python/src/arrow.rs#L53-L67

tustvold commented 1 year ago

If we can think of another place to put this information, I'm open to that.

I personally think Field is the correct location for such metadata, imo DataType should only contain the information for kernels to interpret the physical array data. Concerns about nullability, specialized kernels for extension types, etc... I think should be handled at a higher level. Whilst it isn't a true logical vs physical separation, I think that is a helpful way to conceptualize it.

But it would be nice to find a way to not have to do that.

This feels like a limitation of the way the python conversion is implemented

impl ToPyArrow for ArrayData {
    fn to_pyarrow(&self, py: Python) -> PyResult<PyObject> {
        let array = FFI_ArrowArray::new(self);
        let schema = FFI_ArrowSchema::try_from(self.data_type()).map_err(to_py_err)?;

        let module = py.import("pyarrow")?;
        let class = module.getattr("Array")?;
        let array = class.call_method1(
            "_import_from_c",
            (
                addr_of!(array) as Py_uintptr_t,
                addr_of!(schema) as Py_uintptr_t,
            ),
        )?;
        Ok(array.to_object(py))
    }
}

In particular the schema is inferred from the array's data type. If instead there were a way to provide a Field then this would allow propagating not only extension metadata, but also nullability, dictionary ordering, etc... It would also potentially allow performing the schema conversion once and using the result for multiple arrays.

wjones127 commented 11 months ago

I personally think Field is the correct location for such metadata, imo DataType should only contain the information for kernels to interpret the physical array data. Concerns about nullability, specialized kernels for extension types, etc... I think should be handled at a higher level.

I think that could be a decent approach, although I'm still trying to understand what that would look like. It sounds like the arrow-rs type system is closed, but can be wrapped in a higher-level type system. (whereas, the C++ kernels are extension-aware).

So it sounds like the point to add extension types is when building extensions in datafusion. Eventually, I think it would be nice to have an example in arrow-datafusion showing how to add support for a simple extension type (such as UUID) in the engine. Basically the end result would be showing that

SELECT gen_random_uuid()

outputs

+--------------------------------------+
| gen_random_uuid()                    |
+--------------------------------------+
| eeccb8c5-9943-b2bb-bb5e-222f4e14b687 |
+--------------------------------------+

Showing that you can add methods that output extension types (gen_random_uuid()) and that you can control how those extension types are displayed.

yukkit commented 10 months ago

I encountered the same trouble when trying to add a custom type. Although the extension type can be marked through the metadata of FIeld, the metadata will be lost in the scene of array processing, for example: udf in datafusion

kylebarron commented 8 months ago

I'm taking a stab at migrating my geoarrow-rs crate (which implements the GeoArrow extension array spec) from arrow2 to arrow-rs, and wanted to add that I'm feeling the pain of omission of a DataType::Extension variant in arrow-rs.

In particular, a geospatial algorithm would have to return a Field with every operation, because the physical layout of a LineStringArray is exactly the same as that of a MultiPointArray (and PolygonArray/MultiLineStringArray). Maybe this is nitpicking, but it I've liked the level of abstraction of having the extension metadata on the DataType, because the operations on the array are separate from a named column in a table.

Edit: If I'm understanding correctly, it's also impossible to implement

impl TryFrom<&dyn Array> for GeometryArray

like I could in arrow2, because dyn Array never has any extension type information, so I wouldn't be able to know what type of geometries the array is holding...

alamb commented 8 months ago

@yukkit is contemplating User Defined Types in DataFusion, and the arrow extension type mechanism is the obvious implementation I think -- see https://github.com/apache/arrow-datafusion/issues/7923

I personally think Field is the correct location for such metadata,

@tustvold are you proposing something like the following?

enum DataType { 
....
   List(FieldRef),
   /// Extension type, with potentially embedded metadata in the field reference
   Extension(FieldRef)
}

This proposal runs afoul of how DataType::List works today (where the field name is mostly irrelevant ("item")), but I don't really have any better ideas.

I think this structure would allow @kylebarron to implement

impl TryFrom<&dyn Array> for GeometryArray {
  fn try_from(arr: &dyn Array>) -> Result<Self> {
    match arr.data_type() {
      DataType::Extension(field) if is_geo_type(field.metadata()) => {
        .... do the conversion ...
      }
      dt => Err("Unsupported datatype: {dt}
   }
}

tustvold commented 8 months ago

are you proposing something like the following?

No I'm proposing not making changes to DataType and using the Field metadata that already exists. This way we avoid conflating physical and logical type information. This is the same mechanism we use in IOx to encode the logical notions of tag vs field columns

alamb commented 8 months ago

No I'm proposing not making changes to DataType and using the Field metadata that already exists. This way we avoid conflating physical and logical type information.

So how would we implement @kylebarron 's use case? Perhaps via a RecordBatch (with a single column)?

kylebarron commented 8 months ago

My interpretation of this is that it's a "zero-sum" architecture decision, in the sense that if you don't want to conflate logical and physical types in the DataType enum, then there's intentionally no way to implement From on &dyn Array; instead it's only possible to implement it on (&dyn Array, &FieldRef)

tustvold commented 8 months ago

You would need whatever performs the kernel selection to have the Field, most likely via Schema, I'm not sure you necessarily need this information simultaneously with the Array?

For example, the DF PhysicalExpr could have already extracted the necessary metadata at plan time (although I think it has access to the schema anyway).

wjones127 commented 8 months ago

@tustvold do you think RecordBatch is something that end-users should be seeing? Or do you imagine this should be a hidden implementation detail in all cases?

If it's the latter, I think I can understand the position to keep extension type separate. But if it's the former, it's hard to see how we can provide a decent UX without bringing the extension type into the array itself. For example, if we return a RecordBatches with a UUID, a user might reasonable surprised that the UUID column prints as the raw bytes and not a hyphenated UUID string.

tustvold commented 8 months ago

RecordBatch/Schema is but one way that users might choose to expose logical type information, they might also define their own array abstractions that wrap arrow arrays, or their own schema abstraction at plan time, etc...

As @kylebarron rightly states it's a zero-sum API challenge, either all of arrow must become aware of extension types, or it is confined to the areas that actually care. It seems odd to me to optimize the design here for things that are not present in the specification at the expense of everything else, further it seems unfortunate to optimise for one particular way of encoding logical type information.

I don't have all the answers here, I don't know what a general purpose logical type abstraction looks like, if such a thing even exists, but it does seem that the core library shouldn't be opinionated in this regard

kylebarron commented 7 months ago

This feels like a limitation of the way the python conversion is implemented

For Python conversion specifically, this might be solved with the new PyCapsule interface (ref #5067) because the ArrowSchema FFI struct is generated by pyarrow itself, and so it doesn't have to be inferred from array.type. (I haven't verified how it works with extension arrays yet)

alamb commented 7 months ago

FYI @yukkit has created a PR showing how LogicalTypes might work in DataFusion: https://github.com/apache/arrow-datafusion/pull/8143. It is a pretty neat idea.

apache / arrow-rs

Add an ExtensionType to DataType enum #4472

2444

218