Closed alamb closed 3 months ago
Boolean type is supported, it was added in commit 9845e6e of PR #10711
Sorry for the noise
Actually I don't think this is actually done
This ticket covers extracting DataPage statistics (not row group statistics, which are annoyingly different in parquet 🤯 )
The data page statistics are extracted here
In order to complete this issue, we need to change https://github.com/apache/datafusion/blob/58d23c5c050f43aa7b867d4f0be7298d8d6cad83/datafusion/core/tests/parquet/arrow_statistics.rs#L1956
to
check: Check::Both,
And make the tests pass
Oh sorry, that was stupid of me.
Oh sorry, that was stupid of me.
No worries at all -- this stuff is tricky
Yeah, all those similar names do get to me sometimes...
On another note, I tried to implement this like all the others did, but the test fails with :
thread 'parquet::arrow_statistics::test_boolean' panicked at src/array/boolean_array.rs:407:33:
Iterator must be sized
The implementation is like this:
make_data_page_stats_iterator!(
MinBooleanDataPageStatsIterator,
|x: &PageIndex<bool>| { x.min },
Index::BOOLEAN,
bool
);
make_data_page_stats_iterator!(
MaxBooleanDataPageStatsIterator,
|x: &PageIndex<bool>| { x.max },
Index::BOOLEAN,
bool
);
...
macro_rules! get_data_page_statistics {
($stat_type_prefix: ident, $data_type: ident, $iterator: ident) => {
paste! {
match $data_type {
Some(DataType::Boolean) => Ok(Arc::new(
BooleanArray::from_iter(
[<$stat_type_prefix BooleanDataPageStatsIterator>]::new($iterator).flatten()
)
)),
...
}
These macros, functions, and tests jump around a lot before I get to the caller, which causes this panic. Do you or anyone else know why this happens?
The iterator must be sized thing comes from arrow -- one workaround is to collect the values into a Vec first and then create the array
I don't know why boolean is different than the other data page types 🤔
take
Is your feature request related to a problem or challenge?
Part of https://github.com/apache/datafusion/issues/10922
We are adding APIs to efficiently convert the data stored in Parquet's "PageIndex" into
ArrayRef
s -- which will make it significantly easier to use this information for pruning and other tasks.Describe the solution you'd like
Add support to
StatisticsConverter::min_page_statistics
andStatisticsConverter::max_page_statistics
for the types abovehttps://github.com/apache/datafusion/blob/a923c659cf932f6369f2d5257e5b99128b67091a/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L637-L656
Describe alternatives you've considered
You can follow the model from @Weijun-H in https://github.com/apache/datafusion/pull/10931
test_int64
https://github.com/apache/datafusion/blob/a923c659cf932f6369f2d5257e5b99128b67091a/datafusion/core/tests/parquet/arrow_statistics.rs#L506-L529Additional context
No response