Open brancz opened 23 hours ago
I actually tried to understand what is different about dicts that are not in structs, and it turns out that the row converter also emits plain arrays in those cases, but something turns them back into dictionaries at some point (I'm guessing this has to be in datafusion somewhere).
Example: https://gist.github.com/brancz/9ff04f6263b710ad8215933590026500
Ok found it. This is the place where if the data is of an unexpected type, it's casted to the expected type.
I think the right fix in that case is 1, adding support for nested schemas.
I think the right fix in that case is 1, adding support for nested schemas.
I agree this sounds like it makes sense. There even seems to be an existing ticket: #7647
Note that there is a PR by @jayzhan211 to rework how grouping is done to avoid the RowConverter in many cases in https://github.com/apache/datafusion/pull/12269. I haven't reivewed it thoroughly, but I would suggest that you ensure your fix for this issue is well covered by end to end .slt
tests (not just unit tests)
Describe the bug
Aggregations on a struct of dictionaries produces data with a different schema than expected (plain arrays instead of dictionaries).
For example:
To Reproduce
Have a schema with a struct of dictionaries, and perform an aggregation on it, like
count_distinct
.Full code example here:
https://gist.github.com/brancz/fa12a3ae0f5d09620e9c274384ffd506
Expected behavior
No panic.
Additional context
I can see two ways to solve this: 1) Currently, the aggregation says it will emit data with the dictionaries being dictionaries. Instead, if all it did was declare it would emit plain arrays instead of dictionary-encoded ones, it would not panic. 2) Have
RowConverter
emit the sameDataType
as its input.I think I'm slightly in favor of 1, because with 2 we'd either have to revert to stateful row converters which were removed intentionally, or we'd have to copy data again on emitting to turn the currently plain arrays into dictionaries again.