apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.47k stars 1.01k forks source link

Support comparison operators on nested data types (Struct, List, ..) #10856

Open Blizzara opened 2 weeks ago

Blizzara commented 2 weeks ago

Is your feature request related to a problem or challenge?

We're working on running some used-to-be-Spark pipelines through DataFusion. One case we've noticed where DataFusion doesn't support something is comparing lists. (Spark allows)[https://github.com/apache/spark/blame/d9394eee5ebbeb695baaec6122da2ed970842dfd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L1025] comparing (==, !=, <, >, <=, >=, ..) columns of structs and lists, while in DataFusion those seem to throw:

For structs, from our internal testing:

ArrowError(InvalidArgumentError("Invalid comparison operation: Struct([Field { name: \"a\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"b\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]) <= Struct([Field { name: \"a\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \"b\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }])"), None)

For lists, this is shown in DataFusion's tests: https://github.com/apache/datafusion/blob/3773fb7fb54419f889e7d18b73e9eb48069eb08e/datafusion/sqllogictest/test_files/array_query.slt#L44

Maybe this would need to be improved on Arrow directly, seeing that the error is coming from https://github.com/apache/arrow-rs/blob/087f34b70e97ee85e1a54b3c45c5ed814f500b0a/arrow-ord/src/cmp.rs#L219?

Describe the solution you'd like

Binary predicates to be allowed for structs and lists, preferably following same semantics as in Spark (mostly I think it's a DFS over all the fields https://github.com/apache/spark/blob/d9394eee5ebbeb695baaec6122da2ed970842dfd/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/PhysicalDataType.scala#L285)

Describe alternatives you've considered

No response

Additional context

Related to https://github.com/apache/datafusion/issues/2326

Blizzara commented 2 weeks ago

Ah, https://github.com/apache/datafusion/issues/9254 is related, as is https://github.com/apache/arrow-rs/issues/5411 and https://github.com/apache/arrow-rs/issues/5407

jayzhan211 commented 5 days ago

Note that the comparison for nested type is not supported in arrow-rs https://github.com/apache/arrow-rs/pull/5942, so we should implement them in datafusion. First attempt #11091

Ideally we should make it configurable so we can support both Spark and Postgres like behaviour for nulls