apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.01k stars 1.14k forks source link

Extension Types #12644

Open findepi opened 4 days ago

findepi commented 4 days ago

Is your feature request related to a problem or challenge?

Currently DataFusion provides a lot of built-in types which are useful when building applications / query engines on top of DataFusion. However, even plethora of types is not enough. DataFusion doesn't have types existing in other systems, limiting DataFusion applicability as "LLVM for query engines"

For example, these types commonly found in other systems do not exist today

Describe the solution you'd like

  1. Introduction of DataFusion own type system
  2. Introduction of extensions in DataFusion type system allowing applications building on DataFusion to provide more types
    • the extension types -- not unlike DataFusion built-in types -- need to use Arrow types as "carrier type" for transporting
    • the Arrow type metadata weaved into schema fields can be used to indicate use of extension types to the client, when data is returned to the user in Arrow form
    • for example, a "timestamp with time zone" type could be represented as Struct with two fields: point_in_time, time_zone
  3. Ability to dynamically find operations on types during function resolution or runtime
    • for example a CAST(array<T> AS varchar) needs to know how to do cast(T AS varchar). It cannot delegate this logic fully to Arrow, because Arrow won't have a notion of extension types.
      • eg if "timestamp with time zone" uses a Struct as a carrier type, it still needs to define its own cast(... AS varchar). It cannot use the default cast(struct AS varchar).

Describe alternatives you've considered

Everything is built-in

DataFusion could provide all types needed by applications building on top of DataFusion as built-in DataFusion types. This would be easiest to implement, but could lead to scope-creep for the project. This could also lead to conflicts where types look the same but the desired behavior differs between applications building on top of DataFusion. For example Oracle's and Trino's "timestamp with time zone" can represent political zones while Snowflake's allows only fixed offsets.

No-op

Not providing extension types. This would limit DataFusion applicability. DataFusion cannot be considered "LLVM for query engines" if it cannot serve as an engine, or potential engine, for existing popular query engines.

Additional context

The need to create extension types was raised in the [Proposal] Decouple logical from physical types

However introduction of DataFusion own types does not require introduction of extension types. Extension types are complex enough (especially given their impact on functions) that they deserve their own roadmap issue.

The impact of extension types on functions, functions runtime and resolution is very clear, so this relates to Simple Functions initiative:

Having ExtensionType in arrow-rs would could the implementation simpler:

findepi commented 4 days ago

cc @alamb, @andygrove, @jayzhan211, @ozankabak, @notfilippo, @comphead, @kylebarron, @yukkit, @sunchao, @Folyd, @wjones127, @Xuanwo, @sadboy, @milevin

kylebarron commented 3 days ago

I'm not very knowledgeable about DataFusion internals or database theory, so it's hard for me to provide feedback on the proposal, but I'm very excited about the prospect of extension types to enable spatial types (https://github.com/apache/datafusion/issues/7859). I've been collaborating on the GeoArrow spec, which defines Arrow extension types for spatial data. It's important to have additional logical types because the same physical layout can be interpreted in multiple logical ways (e.g. an array of LineString and MultiPoint), and to store coordinate reference system information (what physical locations on earth these numbers represent) as part of the type. I'm happy to provide more motivating examples if that would help!

findepi commented 2 days ago

FYI i touched upon the topic of types on DataFusion meetup in Belgrade yesterday. The slides are here if anyone is interested: https://docs.google.com/presentation/d/1VW_JCGbN22lrGUOMRvUXGpAmlJopbG02hn_SDYJouiY . It was an attempt to summarize why we need both: simpler types (https://github.com/apache/datafusion/issues/11513), more types (https://github.com/apache/datafusion/issues/12644), and simple function "SDK" (https://github.com/apache/datafusion/issues/12635). The document has comments disabled to avoid diverging the discussion from the github issue.

alamb commented 1 day ago

See a previous proposal from @yukkit : https://github.com/apache/datafusion/issues/7923