apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.48k stars 1.01k forks source link

[Proposal] Support User-Defined Types (UDT) #7923

Open yukkit opened 8 months ago

yukkit commented 8 months ago

Is your feature request related to a problem or challenge?

I've noticed there are some issues regarding adding extension types in DataFusion.

Providing an interface for adding extension types in DataFusion would be highly meaningful. This would allow applications built on DataFusion to easily incorporate business-specific data types.

I hope to promote the development of the UDT feature through this current proposal.

Describe the solution you'd like

User-Defined Types (UDT)

UDT stands for User-Defined Type. It is a feature in database systems that allows users to define their own custom data types based on existing data types provided by the database. This feature enables users to create data structures tailored to their specific needs, providing a higher level of abstraction and organization for complex data.

Syntax

<user-defined type definition> ::=
  CREATE TYPE <user-defined type name> AS <representation>

<representation> ::=
  <predefined type>
| <member list>

<member list> ::=
  <left paren> <member> [ { <comma> <member> }... ] <right paren>

<member> ::=
  <attribute name> <data type>

<attribute name> ::=
  <identifier>

Behaviors

Behaviors of Data Types

  1. Type matching assessment.
  2. Computation of the common super type for two types.

Behaviors of Data

  1. Inference of data type from literal value.
  2. Casting literal value to other type.
  3. Casting variable value to other type.
  4. Import and export of data. (Sensitive to logical data types)
  5. Operations like data comparison, etc.

Role of Data Types in the SQL Lifecycle

SQL Statement String -> AST

None

AST -> Logical Plan

  1. Create Type
    • Parsing, constructing, and storing the description of UDT.
  2. Create Table (Using Type)
    • Parsing data types
      • Built-in types
      • User-defined types
    • Constructing DFField (using metadata field to tag extended types), storing metadata.
  3. Query
    • How to construct extended data types?
      • Use the STRUCT function.
      • Use UDF.
    • How to perform relationship (comparison) operations, logical operations, arithmetic operations with other data types? How to perform type conversion?
      • Constant to UDT
        1. Use arrow conversion rules.
      • Variable to UDT
        1. Judge if cast can be performed according to arrow rules, and add cast expression as needed.
      • UDT to other data types
        1. Judge if cast can be performed according to arrow rules, and add cast expression as needed.

          e.g. Any binary to UUID (DataType::FixedSizeBinary(16)), if data layout is the same but data content format is different, conversion is not possible. But from my understanding, UDT is not related to data content, only to data type, so this is not a problem.

    • Hashing, sorting?
      • Use arrow DataType.

Logical Plan -> Execution Plan

None

Execution Plan -> ResultSet

  1. Cast
    • Execute according to arrow DataType's cast logic.
  2. Comparison, operations, etc.
    • Execute according to arrow DataType's logic.
  3. TableScan/TableWrite
    • Identify extended types through Field metadata, thus performing special serialization or deserialization.

Core Structures

/// UDT Signature
/// <udt_name>[ (<param>[ {,<param>}... ]) ]
pub struct TypeSignature<'a'> {
  name: Cow<'a, str>,
  params: Vec<Cow<'a, str>>,
}

/// UDT Entity
pub struct UserDefinedType {
  signature: TypeSignature,
  physical_type: DataType,
}

impl UserDefinedType {
  /// Physical data type
  pub fn arrow_type(&self) -> DataType;
  /// Metadata used to tag extended data types
  pub fn metadata(&self) -> HashMap<String, String>;
}

pub trait ContextProvider { 
  /// Get UDT description by signature
  fn udt(&self, type_signature: TypeSignature) -> Result<Arc<UserDefinedType>>;

  ......
}

Examples

create udt

CREATE TYPE user_id_t AS BIGINT;
CREATE TYPE email_t AS String;
CREATE TYPE person_t AS (
  user_id user_id_t,
  first_name String,
  last_name String,
  age INTEGER,
  email email_t);

DROP TYPE person_t;
DROP TYPE email_t;
DROP TYPE user_id_t;

geoarrow

https://github.com/geoarrow/geoarrow/blob/main/extension-types.md

Point

type_signature: Geometry(Point)
arrow_type: DataType::FixedSizeList(xy, 2)
metadata: { "ARROW:extension:name": "geoarrow.point" }

Questions

  1. Is the UDF sensitive to extended types (e.g., encoding of extended type data in binary, where type tagging exists only in Field metadata and cannot be obtained during UDF runtime)?

Describe alternatives you've considered

No response

Additional context

@alamb I am particularly eager to receive your feedback or suggestions on this proposal. Additionally, I highly encourage individuals who are familiar with or interested in this feature to contribute their improvement ideas.

alamb commented 8 months ago

Thank you for this @yukkit -- I think the high level idea would work really nicely into the DataFusion story of extensibility.

I think the core challenge of implementing this feature is how to work it into the existing code. DataFusion uses DataType directly all over its code base, and I think it is close to infeasible now to try and change that.

One way to model user defined types in DataFusion would be as an arrow extension type (which would need upstream support as described in https://github.com/apache/arrow-rs/issues/4472). Then the DataFusion codebase could treat all user defined types as arrow extension types, using the UserDefinedType metadata to look up the various information needed for planning and execution. We would have to extend the various codepaths to know about and handle extension types.

There is also a somewhat related discussion on https://github.com/apache/arrow-datafusion/discussions/7421 about how DataType encodes both encoding and logical type

yukkit commented 8 months ago

@alamb I read the discussion on data types in #7421. I understand that there is no concept of logical types in arrow-rs. Rather than adding an ExtensionType in arrow DataType, I am more inclined to introduce a LogicalType in DF, just as discussed in #7421.

My thoughts are as follows:

  1. Under the current arrow-rs architecture, adding logical types is a relatively difficult and labor-intensive task. Additionally, there is uncertainty about the actual benefits. I believe this is one of the reasons why @tustvold approaches this matter with caution.

  2. Introducing a LogicalType in DF would have a relatively small impact. The LogicalType only comes into play during the planning phase and is not significantly different from the existing logic (based on arrow datatype).

  3. It would be easier to add extension types. However, it is important to clearly define the behavior and scope of LogicalType and PhysicalType.

Finally, in my opinion, both of the following plans are feasible:

  1. Add ExtensionType in arrow DataType
  2. Add LogicalType in DF

I look forward to hearing more suggestions on which plan to ultimately implement

dojiong commented 8 months ago

I think we should support ExtensionType for TableProvider(define a column with some extension type) whatever the plan we choose finally.

  1. Add LogicalType in DF

TableProvider returns Schema defined in arrow, so we need another Schema type with LogicalType in DF?

yukkit commented 8 months ago

TableProvider returns Schema defined in arrow, so we need another Schema type with LogicalType in DF?

I think that TableProvider need to return schema with LogicalType Instead of the schema containing arrow data type.

alamb commented 8 months ago

Introducing a LogicalType in DF would have a relatively small impact. The LogicalType only comes into play during the planning phase and is not significantly different from the existing logic (based on arrow datatype).

I am not sure about how large the change is, but my intuition is that it would be substantial. However, the only way to find out I think would be to try.

Finally, in my opinion, both of the following plans are feasible: Add ExtensionType in arrow DataType Add LogicalType in DF

I agree they are both feasible, though they come with different tradeoffs.

I think the next step is probably to prototype one of the approaches with a technical spike (aka make a PR with that it would look like, to get a sense of the API as well as what would need to be changed)

yukkit commented 8 months ago

@alamb I very much agree with you, I will try it next, and if it proves feasible, I will submit draft pr step by step.