Open alamb opened 11 months ago
Note that the pruning predicate code does correctly read the statistics for other strings and timestamps, because it uses a different code path
I plan to fix this
Could I pick this ticket up?
In fn summarize_min_max
, it cannot handle ByteArray(ValueStatistics<ByteArray>)
well. Do we need to convert it to a different type like timestamps, strings, etc 🤔 ?
In
fn summarize_min_max
, it cannot handleByteArray(ValueStatistics<ByteArray>)
well. Do we need to convert it to a different type like timestamps, strings, etc 🤔 ?
I think there is some subtly related to decimals as well -- the best thing to do is probably to study what the existing code in row_groups does -- I think it is here https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L57
At some point there were multiple code paths to extract statistics in parquet (one for file level and one for row group level) that should likely be combined
Describe the bug
While working on https://github.com/apache/arrow-datafusion/issues/8229 I found another bug that is non obvious, but that can be clearly seen now thanks to https://github.com/apache/arrow-datafusion/issues/8110 and https://github.com/apache/arrow-datafusion/issues/8111 from @NGA-TRAN
To Reproduce
And then look at the explain verbose up can see there are no min/max statisics shown:
Expected behavior
I expect there to be min/max values extracted in the statistics for the strings, as there are for integers (
(Col[0]: Min=Exact(Int64(1)) Max=Exact(Int64(3))
)Additional context
No response