apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.51k stars 1.03k forks source link

[EPIC] Improved support for nested / structured types (`Struct` , `List`, `ListArray`, and other Composite types) #2326

Open alamb opened 2 years ago

alamb commented 2 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. This ticket is designed to capture the work needed to properly support Arrow Struct types in DataFusion

https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html says that nested types are not supported; The are not fully supported, but there are parts of the support already present such as a way to serialize them via ArrowWriter and using field["nested_field"] syntax

Describe the solution you'd like Research, and describe / implement what is else remains for proper support.

Array (ListArray) support:

Map (MapArray) support:

Struct (StructArray) support:

Union (UnionArray) support

Other

Known issues so far:

nl5887 commented 2 years ago

This https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/file_format/mod.rs#L238 is one reason of errors related to column projection. It compares the complete enum, failing on different field order.

Arrow has a method to compare data types (https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs#L674). I think this method should me made public, and used in above.

Currently datafusion uses match_field_names (default true), https://github.com/apache/arrow-rs/blob/master/arrow/src/record_batch.rs#L153 causing the error.

alamb commented 2 years ago

Thanks for the investigation @nl5887 -- that sounds definitely plausible. Feel free to file a PR with proposed changed -- we would love to review them

nl5887 commented 2 years ago

This one is also related: https://github.com/apache/arrow-datafusion/issues/2581

tv42 commented 1 year ago

Reminder to write docs: #1222

alexwilcoxson-rel commented 10 months ago

Potential to add to list #7012

alamb commented 3 months ago

We are starting to make progress on struct support --

There is a PR up to support named_struct https://github.com/apache/arrow-datafusion/pull/9743 and work afoot to support nicer literal syntax: https://github.com/apache/arrow-datafusion/issues/9820 🚀

toaiduongdh commented 2 months ago

Hi, i think unnest support for struct can be an item in this epic right?

alamb commented 2 months ago

Hi, i think unnest support for struct can be an item in this epic right?

That would make sense to me -- is there a ticket that describes what this means?

duongcongtoai commented 2 months ago

i created a ticket: https://github.com/apache/datafusion/issues/10264

alamb commented 2 months ago

i created a ticket: #10264

Thank you. I added this to the list in the ticket description

duongcongtoai commented 1 month ago

I added an issue to support recursive unnest: https://github.com/apache/datafusion/issues/10660, i think it shoul belong to this epic

alamb commented 1 month ago

I added an issue to support recursive unnest: #10660, i think it shoul belong to this epic

Added