Closed appletreeisyellow closed 3 months ago
Yes - the code summarizing the max and min isn't working correctly for a Dictionary. In the test case, the max_value or min_value in a StringArray that needs to be mapped to the appropriate Dictionary type before being passed into the update_batch methods. I will have a go at fixing it but can't do that until tomorrow morning so someone else should be feel free to pick it up if they need a fix before then.
In fact, it seems to me that the mapping to the correct dictionary type should probably be performed here? https://github.com/apache/datafusion/blob/7e49ccf3dd3408bc9c4adb86f070d1e3d1f4c1e2/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L452-L454
Well, I don't think it can be easily modified at the source and that maybe isn't the right thing to do. So probably best to just address it in summarize_min_max_null_counts
.
take
Describe the bug
When a column has data type in
Dictionary
, the parquet metadata statistics returnsExact(Dictionary(Int32, Utf8(NULL)))
for min and max valuesTo Reproduce
Run the test below in this file: https://github.com/apache/datafusion/blob/8216e32e87b2238d8814fe16215c8770d6c327c8/datafusion/core/src/datasource/file_format/parquet.rs#L1363
Expected behavior
Expect statistics to show the min and max values. For the reproducer given above, I'm expecting to get:
max_value
:Exact(Dictionary(Int32, Utf8("a")))
min_value
:Exact(Dictionary(Int32, Utf8("d")))
Additional context
The underlying statistics extraction code should have no problems extracting statistics from Dictionary columns
The code is
https://github.com/apache/datafusion/blob/7e49ccf3dd3408bc9c4adb86f070d1e3d1f4c1e2/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L452-L454
And the tests are here:
https://github.com/apache/datafusion/blob/7e49ccf3dd3408bc9c4adb86f070d1e3d1f4c1e2/datafusion/core/tests/parquet/arrow_statistics.rs#L1729-L1768
I wonder if something about the code that summarizes the statistics across row groups https://github.com/apache/datafusion/blob/7e49ccf3dd3408bc9c4adb86f070d1e3d1f4c1e2/datafusion/core/src/datasource/file_format/parquet.rs#L468-L495
doesn't handle dictionaries correctly 🤔