Is your feature request related to a problem or challenge?
As we work to make extracting statistics from parquet data pages more correct and performant in https://github.com/apache/datafusion/issues/10922 one thing that would be good is to have benchmark overage
Describe the solution you'd like
Add a benchmark for extracting page statistics
Describe alternatives you've considered
Add a benchmark (source) for extracting data page statistics
These are run via
cargo bench --bench parquet_statistic
In order to create a reasonable number of data page staistics, it would be good to configure the parquet writer to limit the sizez of data pages
Is your feature request related to a problem or challenge?
As we work to make extracting statistics from parquet data pages more correct and performant in https://github.com/apache/datafusion/issues/10922 one thing that would be good is to have benchmark overage
Describe the solution you'd like
Add a benchmark for extracting page statistics
Describe alternatives you've considered
Add a benchmark (source) for extracting data page statistics
These are run via
In order to create a reasonable number of data page staistics, it would be good to configure the parquet writer to limit the sizez of data pages
https://github.com/apache/datafusion/blob/ece7ae5eca451bb2599f13f9f9197fd93b2a8bc2/datafusion/core/benches/parquet_statistic.rs#L75
And use https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterProperties.html#method.data_page_row_count_limit to set the the limit to 1 and then send the data in row by row as we did in the test:
https://github.com/apache/datafusion/blob/d175163ef6442056d8210de9b0e28e264c39ca2c/datafusion/core/tests/parquet/arrow_statistics.rs#L105-L130
Additional context
The need for a benchmark also came up in https://github.com/apache/datafusion/pull/10932