apache / druid

Apache Druid: a high performance real-time analytics database.
https://druid.apache.org/
Apache License 2.0
13.46k stars 3.7k forks source link

druid-deltalake-extensions support for StructType #16782

Closed Donutellko closed 2 months ago

Donutellko commented 2 months ago

Description

Hello, we are trying to load data from a DeltaTable, and facing an issue with StructType (manually formatted error message, full stack trace in the attachments: druid-delta-unsupported-StructType.log ):

Failed to sample data: Unsupported data type[
  struct(
    StructField(name=FieldOne,type=string,nullable=true,metadata={}), 
    StructField(name=FieldTwo,type=string,nullable=true,metadata={})
  )
] for fieldName[MetaData].
        at org.apache.druid.error.DruidException$DruidExceptionBuilder.build(DruidException.java:460)
        at ...
        at org.apache.druid.error.InvalidInput.exception(InvalidInput.java:30)
        at org.apache.druid.delta.input.DeltaInputRow.getValue(DeltaInputRow.java:201)
        at org.apache.druid.delta.input.DeltaInputRow._getRaw(DeltaInputRow.java:163)
        at org.apache.druid.delta.input.DeltaInputRow.<init>(DeltaInputRow.java:74)
        at org.apache.druid.delta.input.DeltaInputSourceReader$DeltaInputSourceIterator.next(DeltaInputSourceReader.java:140)
        at ...

Using apache/druid:30.0.0

Expected behavior:

Motivation

Donutellko commented 2 months ago

Feature initially introduced in #15755.

@abhishekrb19, I appreciate a lot your work and would appreciate even more your kind response.

abhishekrb19 commented 2 months ago

Hi @Donutellko, thanks for reporting. IIRC struct and array types weren't fully supported with the upstream Delta Kernel library in 3.0.0 when the extension was originally written. Now that we use Kernel 3.2.0, it seems that support has been added. I will look into adding it in the Druid connector.

Re the expected behavior you note:

Expected behavior:

  • StructType's StructFields are loaded as a set of columns with a common prefix: MetaData.FieldOne, MetaData.FieldTwo, ...;
  • or (at least) StructType is loaded as a JSON string.
  • Additionally, I would like to discuss a possibility of loading delta ArrayType as a JSON string.

I think the Delta input source should just write structs as json and arrays as arrays. For structs, if a user wants to flatten the fields or extract/transform specific fields present in the struct, it should still be possible to do so using the SQL JSON functions that can be used at ingest and/or query time . For example, JSON_VALUE("MetaData", '$.FieldOne') AS "Metadata.FieldOne" will extract FieldOne as a separate column in whichever way you'd like.

Does that sound good to you?

Donutellko commented 2 months ago

Thank you for your response @abhishekrb19.

I think the Delta input source should just write structs as json and arrays as arrays. <...> Does that sound good to you?

Yes, that sounds good. Could you provide any ETA for the implementation?

abhishekrb19 commented 2 months ago

@Donutellko the fix was merged in https://github.com/apache/druid/pull/16884. It will be available in the next release, Druid 31.0.0.