apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.47k stars 1.01k forks source link

Add a benchmark for extracting parquet data page statistics #10934

Open alamb opened 1 week ago

alamb commented 1 week ago

Is your feature request related to a problem or challenge?

As we work to make extracting statistics from parquet data pages more correct and performant in https://github.com/apache/datafusion/issues/10922 one thing that would be good is to have benchmark overage

Describe the solution you'd like

Add a benchmark for extracting page statistics

Describe alternatives you've considered

Add a benchmark (source) for extracting data page statistics

These are run via

cargo bench --bench parquet_statistic

In order to create a reasonable number of data page staistics, it would be good to configure the parquet writer to limit the sizez of data pages

https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/benches/parquet_statistic.rs#L75

And use https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterProperties.html#method.data_page_row_count_limit to set the the limit to 1 and then send the data in row by row as we did in the test:

https://github.com/apache/datafusion/blob/d175163ef6442056d8210de9b0e28e264c39ca2c/datafusion/core/tests/parquet/arrow_statistics.rs#L105-L130

Additional context

The need for a benchmark also came up in https://github.com/apache/datafusion/pull/10932

marvinlanhenke commented 1 week ago

take