apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
823 stars 163 forks source link

chore: Bump arrow-rs to 53.1.0 and datafusion #1001

Closed kazuyukitanimura closed 1 month ago

kazuyukitanimura commented 1 month ago

Which issue does this PR close?

Rationale for this change

Arrow-rs 53.1.0 includes performance improvements

What changes are included in this PR?

Bumping arrow-rs to 53.1.0 and datafusion to a revision

How are these changes tested?

existing tests

kazuyukitanimura commented 1 month ago

@alamb @jayzhan211 I am still investigating but looks like there is a regression in DataFusion related to https://github.com/apache/datafusion/pull/12753

assertion `left == right` failed: Arrays with inconsistent types passed to MutableArrayData
  left: Struct([Field { name: "a", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])
 right: Struct([Field { name: "a", data_type: Dictionary(Int32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])
        at comet::errors::init::{{closure}}(/__w/datafusion-comet/datafusion-comet/native/core/src/errors.rs:151)
        at <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/alloc/src/boxed.rs:2036)
        at std::panicking::rust_panic_with_hook(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:799)
        at std::panicking::begin_panic_handler::{{closure}}(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:664)
        at std::sys_common::backtrace::__rust_end_short_backtrace(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:171)
        at rust_begin_unwind(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652)
        at core::panicking::panic_fmt(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72)
        at core::panicking::assert_failed_inner(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:404)
        at core::panicking::assert_failed(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:364)
        at arrow_data::transform::MutableArrayData::with_capacities(/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-53.1.0/src/transform/mod.rs:418)
        at datafusion_functions_nested::make_array::array_array(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/make_array.rs:223)
        at datafusion_functions_nested::make_array::make_array_inner(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/make_array.rs:153)
        at core::ops::function::Fn::call(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:79)
        at datafusion_functions_nested::utils::make_scalar_function::{{closure}}(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/utils.rs:83)
        at <datafusion_functions_nested::make_array::MakeArray as datafusion_expr::udf::ScalarUDFImpl>::invoke(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/make_array.rs:96)
        at datafusion_expr::udf::ScalarUDF::invoke(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/expr/src/udf.rs:197)
        at <datafusion_physical_expr::scalar_function::ScalarFunctionExpr as datafusion_physical_expr_common::physical_expr::PhysicalExpr>::evaluate(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/physical-expr/src/scalar_function.rs:145)
        at datafusion_physical_plan::projection::ProjectionStream::batch_project::{{closure}}(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/physical-plan/src/projection.rs:294)

Especially https://github.com/apache/datafusion/pull/12753/files#diff-f1e354d4fe26237064d8194e10a6008efa4f88e2b68b8a8352086a5d011180b8R108 type_union_resolution returns None for array that contains both Utf8 and Dictionary(Int32, Utf8)

Previously, coerce_types was coercing such cases. Not sure if this is an intentional change...

kazuyukitanimura commented 1 month ago

Just realized that type_union_resolution_coercion not handling struct...

jayzhan211 commented 1 month ago

We need to support Struct, but I'm not sure whether we should discard dictionary.

I assume if you have dictionary type at the first place, you expect to do some optimization based on dictionary type, so if we have primitive and dictionary type, I think it makes sense to keep dictionary type and coerce the primitive type to dictionary type.

Previous code utilize comparison coercion, in this case we just care about comparison therefore we don't need dictionary type at all.

kazuyukitanimura commented 1 month ago

Confirmed it was about adding structure support. Opened https://github.com/apache/datafusion/issues/12843 Ideal to fix it before the next release, otherwise it is a regression. @jayzhan211 @alamb For this specific test case to pass, I just needed to add

.or_else(|| struct_coercion(lhs_type, rhs_type))

at the end of type_union_resolution_coercion. Dictionary is fine, it seems to be working.

kazuyukitanimura commented 1 month ago

@andygrove @comphead @huaxingao @mbutrovich @parthchandra @viirya

kazuyukitanimura commented 1 month ago

Thanks @comphead I think the next DF release will happen soon, so we can bump in the next PR or if they release it first, I can update this PR.

andygrove commented 1 month ago

Thanks @comphead I think the next DF release will happen soon, so we can bump in the next PR or if they release it first, I can update this PR.

I think we can start the DF 43 release prep this week

kazuyukitanimura commented 1 month ago

Thanks, merged @jayzhan211 @comphead @andygrove @parthchandra @viirya