apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.93k stars 1.13k forks source link

Panic on aggregations on struct of dictionaries #12542

Open brancz opened 23 hours ago

brancz commented 23 hours ago

Describe the bug

Aggregations on a struct of dictionaries produces data with a different schema than expected (plain arrays instead of dictionaries).

For example:

thread 'tokio-runtime-worker' panicked at /Users/brancz/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-array-53.0.0/src/array/struct_array.rs:90:46:
called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Incorrect datatype for StructArray field \"a\", expected Dictionary(Int32, Utf8) got Utf8")

To Reproduce

Have a schema with a struct of dictionaries, and perform an aggregation on it, like count_distinct.

Full code example here:

https://gist.github.com/brancz/fa12a3ae0f5d09620e9c274384ffd506

Expected behavior

No panic.

Additional context

I can see two ways to solve this: 1) Currently, the aggregation says it will emit data with the dictionaries being dictionaries. Instead, if all it did was declare it would emit plain arrays instead of dictionary-encoded ones, it would not panic. 2) Have RowConverter emit the same DataType as its input.

I think I'm slightly in favor of 1, because with 2 we'd either have to revert to stateful row converters which were removed intentionally, or we'd have to copy data again on emitting to turn the currently plain arrays into dictionaries again.

brancz commented 18 hours ago

I actually tried to understand what is different about dicts that are not in structs, and it turns out that the row converter also emits plain arrays in those cases, but something turns them back into dictionaries at some point (I'm guessing this has to be in datafusion somewhere).

Example: https://gist.github.com/brancz/9ff04f6263b710ad8215933590026500

brancz commented 17 hours ago

Ok found it. This is the place where if the data is of an unexpected type, it's casted to the expected type.

I think the right fix in that case is 1, adding support for nested schemas.

alamb commented 17 hours ago

I think the right fix in that case is 1, adding support for nested schemas.

I agree this sounds like it makes sense. There even seems to be an existing ticket: #7647

Note that there is a PR by @jayzhan211 to rework how grouping is done to avoid the RowConverter in many cases in https://github.com/apache/datafusion/pull/12269. I haven't reivewed it thoroughly, but I would suggest that you ensure your fix for this issue is well covered by end to end .slt tests (not just unit tests)