Closed gruuya closed 6 hours ago
To reiterate, the real check for a fix here should be passing test like this
#[test]
fn test_simplify_cast_list() {
use datafusion_functions_nested::expr_fn::make_array;
let element_field = Arc::new(Field::new("element", DataType::Int32, true));
let items_field =
Field::new("items", DataType::List(element_field.clone()), true);
let schema = Schema::new(vec![items_field.clone()])
.to_dfschema()
.unwrap();
let input = cast(
make_array(vec![lit(2), lit(3)]),
DataType::List(element_field.clone()),
);
let output = simplify(input.clone());
assert_eq!(
input.data_type_and_nullable(&schema).unwrap(),
output.data_type_and_nullable(&schema).unwrap(),
);
}
In other words we'd want the type (and null-ability?) to remain the same after simplification.
Ok I think that whereas https://github.com/apache/datafusion/pull/13468 restores the data type equivalence after simplification (and thus unblocks the delta-rs issue) it does not help with nullabilty matching as well. Arguably nullability equivalence should not be a blocker, and could be tracked separately I think (if at all).
Indeed digging a bit more suggests that non-matching nullability is sort of ok, as it can be explained away like this
true
for the input nullability, as it falls back to the default value in ScalarUDFImpl::is_nullable
(which MakeArray
doesn't override) https://github.com/apache/datafusion/blob/5ee524ec70b1f10a078caca62954ce37b2dc3cc6/datafusion/expr/src/udf.rs#L476-L478false
for the output nullability, as it reduces to checking whether the single list scalar element is null or not https://github.com/apache/datafusion/blob/5ee524ec70b1f10a078caca62954ce37b2dc3cc6/datafusion/common/src/scalar/mod.rs#L1559Incidentally the first one matches the actual nullability in the schema, whereas the second doesn't/
@gruuya I am sort of lost --are you saying that https://github.com/apache/datafusion/pull/13468 does not fix this issue? That would make sense to me
Sorry for the confusion, the tldr is that :
So I think we should merge it once everything's been addressed.
Describe the bug
It seems
SimplifyExpressions
will sometimes omit a cast which was introduced by theTypeCoercion
analyzer rule, leading to a loss in the precision of the output DataType.To Reproduce
Run
The cast clause gets stripped since the constant expression gets evaluated, so the first assert (expectedly) fails with
However the second assert will also (unexpectedly) fail with
Meaning that the custom list item field name is lost now.
Effects
I believe this in turn causes https://github.com/delta-io/delta-rs/pull/2886.
Basically
TypeCoercion
injects the CAST, whileSimplifyExpressions
omits it without replacing the precise data type. ConsequentlyOptimizeProjections
folds the two projections, but the schema is computed from the outermost projection expressions, which now lack the proper type info (i.e. the non-standard list item field names).Minimal repro:
Fails with
Error: Context("Optimizer rule 'optimize_projections' failed", Context("optimize_projections", Internal("Failed due to a difference in schemas, original schema: DFSchema { inner: Schema { fields: [Field { name: \"items\", data_type: List(Field { name: \"element\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None], functional_dependencies: FunctionalDependencies { deps: [] } }, new schema: DFSchema { inner: Schema { fields: [Field { name: \"items\", data_type: List(Field { name: \"item\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }, field_qualifiers: [None], functional_dependencies: FunctionalDependencies { deps: [] } }")))
Expected behavior
SimplifyExpressions
should retain exact output type, and there should be no optimizer errors in the second (high-level) example.Additional context
No response