apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.63k stars 803 forks source link

Add Parquet RowSelection benchmark #6623

Closed XiangpengHao closed 1 month ago

XiangpengHao commented 1 month ago

Which issue does this PR close?

Part of #5523

Rationale for this change

As the first step of measure-then-build, we add some benchmarks.

The benchmark has 300_000 rows, and the selector will select 1/3 of the rows, this roughly matches with the SearchPhase <> '' predicate in many ClickBench queries.

I added intersection, union, from_filters and and_then because they are the most pronounced ones in the flamegraph.

What changes are included in this PR?

Are there any user-facing changes?

tustvold commented 1 month ago

Thank you for this, I'm sure you're aware and what you're trying to empirically demonstrate, but RowSelection is not designed for highly non-contiguous, e.g. random selections. It might be worth adding some benchmarks of long contiguous selections, as might arise when filtering sorted data

alamb commented 1 month ago

🫶

but RowSelection is not designed for highly non-contiguous, e.g. random selections.

yes, I think this is what @XiangpengHao is considering improving

It might be worth adding some benchmarks of long contiguous selections, as might arise when filtering sorted data

I agree adding benchmarks for the case where RowSelection already does well would be valuable (to ensure we don't introduce regressions)