apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.06k stars 1.14k forks source link

Support `Boolean` Parquet Data Page Statistics #11027

Closed alamb closed 3 months ago

alamb commented 3 months ago

Is your feature request related to a problem or challenge?

Part of https://github.com/apache/datafusion/issues/10922

We are adding APIs to efficiently convert the data stored in Parquet's "PageIndex" into ArrayRefs -- which will make it significantly easier to use this information for pruning and other tasks.

Describe the solution you'd like

Add support to StatisticsConverter::min_page_statistics and StatisticsConverter::max_page_statistics for the types above

https://github.com/apache/datafusion/blob/a923c659cf932f6369f2d5257e5b99128b67091a/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L637-L656

Describe alternatives you've considered

You can follow the model from @Weijun-H in https://github.com/apache/datafusion/pull/10931

  1. Update the test for the listed data types following the model of test_int64 https://github.com/apache/datafusion/blob/a923c659cf932f6369f2d5257e5b99128b67091a/datafusion/core/tests/parquet/arrow_statistics.rs#L506-L529
  2. Add any required implementation in https://github.com/apache/datafusion/blob/2f4347647172f6997448b2e24d322b50c856f3a0/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L575-L586 (follow the model of the row counts, https://github.com/apache/datafusion/blob/2f4347647172f6997448b2e24d322b50c856f3a0/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L90)

Additional context

No response

LorrensP-2158466 commented 3 months ago

Boolean type is supported, it was added in commit 9845e6e of PR #10711

alamb commented 3 months ago

Sorry for the noise

alamb commented 3 months ago

Actually I don't think this is actually done

This ticket covers extracting DataPage statistics (not row group statistics, which are annoyingly different in parquet 🤯 )

The data page statistics are extracted here

https://github.com/apache/datafusion/blob/18042fd69138e19613844580408a71a200ea6caa/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L612-L627

In order to complete this issue, we need to change https://github.com/apache/datafusion/blob/58d23c5c050f43aa7b867d4f0be7298d8d6cad83/datafusion/core/tests/parquet/arrow_statistics.rs#L1956

to

 check: Check::Both, 

And make the tests pass

LorrensP-2158466 commented 3 months ago

Oh sorry, that was stupid of me.

alamb commented 3 months ago

Oh sorry, that was stupid of me.

No worries at all -- this stuff is tricky

LorrensP-2158466 commented 3 months ago

Yeah, all those similar names do get to me sometimes...

On another note, I tried to implement this like all the others did, but the test fails with :

thread 'parquet::arrow_statistics::test_boolean' panicked at src/array/boolean_array.rs:407:33:
Iterator must be sized

The implementation is like this:

make_data_page_stats_iterator!(
    MinBooleanDataPageStatsIterator,
    |x: &PageIndex<bool>| { x.min },
    Index::BOOLEAN,
    bool
);
make_data_page_stats_iterator!(
    MaxBooleanDataPageStatsIterator,
    |x: &PageIndex<bool>| { x.max },
    Index::BOOLEAN,
    bool
);
...
macro_rules! get_data_page_statistics {
    ($stat_type_prefix: ident, $data_type: ident, $iterator: ident) => {
        paste! {
            match $data_type {
                Some(DataType::Boolean) => Ok(Arc::new(
                    BooleanArray::from_iter(
                        [<$stat_type_prefix BooleanDataPageStatsIterator>]::new($iterator).flatten()
                    )
                )),
       ...
}

These macros, functions, and tests jump around a lot before I get to the caller, which causes this panic. Do you or anyone else know why this happens?

alamb commented 3 months ago

The iterator must be sized thing comes from arrow -- one workaround is to collect the values into a Vec first and then create the array

I don't know why boolean is different than the other data page types 🤔

LorrensP-2158466 commented 3 months ago

take