Closed kazuyukitanimura closed 1 month ago
@alamb @jayzhan211 I am still investigating but looks like there is a regression in DataFusion related to https://github.com/apache/datafusion/pull/12753
assertion `left == right` failed: Arrays with inconsistent types passed to MutableArrayData
left: Struct([Field { name: "a", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])
right: Struct([Field { name: "a", data_type: Dictionary(Int32, Utf8), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])
at comet::errors::init::{{closure}}(/__w/datafusion-comet/datafusion-comet/native/core/src/errors.rs:151)
at <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/alloc/src/boxed.rs:2036)
at std::panicking::rust_panic_with_hook(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:799)
at std::panicking::begin_panic_handler::{{closure}}(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:664)
at std::sys_common::backtrace::__rust_end_short_backtrace(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/sys_common/backtrace.rs:171)
at rust_begin_unwind(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652)
at core::panicking::panic_fmt(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72)
at core::panicking::assert_failed_inner(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:404)
at core::panicking::assert_failed(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:364)
at arrow_data::transform::MutableArrayData::with_capacities(/usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-53.1.0/src/transform/mod.rs:418)
at datafusion_functions_nested::make_array::array_array(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/make_array.rs:223)
at datafusion_functions_nested::make_array::make_array_inner(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/make_array.rs:153)
at core::ops::function::Fn::call(/rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/ops/function.rs:79)
at datafusion_functions_nested::utils::make_scalar_function::{{closure}}(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/utils.rs:83)
at <datafusion_functions_nested::make_array::MakeArray as datafusion_expr::udf::ScalarUDFImpl>::invoke(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/functions-nested/src/make_array.rs:96)
at datafusion_expr::udf::ScalarUDF::invoke(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/expr/src/udf.rs:197)
at <datafusion_physical_expr::scalar_function::ScalarFunctionExpr as datafusion_physical_expr_common::physical_expr::PhysicalExpr>::evaluate(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/physical-expr/src/scalar_function.rs:145)
at datafusion_physical_plan::projection::ProjectionStream::batch_project::{{closure}}(/usr/local/cargo/git/checkouts/datafusion-c36d6291a88e48f3/577e4bb/datafusion/physical-plan/src/projection.rs:294)
Especially https://github.com/apache/datafusion/pull/12753/files#diff-f1e354d4fe26237064d8194e10a6008efa4f88e2b68b8a8352086a5d011180b8R108 type_union_resolution
returns None
for array that contains both Utf8
and Dictionary(Int32, Utf8)
Previously, coerce_types
was coercing such cases. Not sure if this is an intentional change...
Just realized that type_union_resolution_coercion
not handling struct...
We need to support Struct, but I'm not sure whether we should discard dictionary.
I assume if you have dictionary type at the first place, you expect to do some optimization based on dictionary type, so if we have primitive and dictionary type, I think it makes sense to keep dictionary type and coerce the primitive type to dictionary type.
Previous code utilize comparison coercion
, in this case we just care about comparison therefore we don't need dictionary type at all.
Confirmed it was about adding structure support. Opened https://github.com/apache/datafusion/issues/12843 Ideal to fix it before the next release, otherwise it is a regression. @jayzhan211 @alamb For this specific test case to pass, I just needed to add
.or_else(|| struct_coercion(lhs_type, rhs_type))
at the end of type_union_resolution_coercion
. Dictionary is fine, it seems to be working.
@andygrove @comphead @huaxingao @mbutrovich @parthchandra @viirya
Thanks @comphead I think the next DF release will happen soon, so we can bump in the next PR or if they release it first, I can update this PR.
Thanks @comphead I think the next DF release will happen soon, so we can bump in the next PR or if they release it first, I can update this PR.
I think we can start the DF 43 release prep this week
Thanks, merged @jayzhan211 @comphead @andygrove @parthchandra @viirya
Which issue does this PR close?
Rationale for this change
Arrow-rs 53.1.0 includes performance improvements
What changes are included in this PR?
Bumping arrow-rs to 53.1.0 and datafusion to a revision
How are these changes tested?
existing tests