apache / orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
https://orc.apache.org/
Apache License 2.0
671 stars 477 forks source link

ORC-1667: Add `check` tool to check the index of the specified column #1862

Closed cxzl25 closed 3 months ago

cxzl25 commented 3 months ago

What changes were proposed in this pull request?

This PR aims to check the index of the specified column.

We can test the filtering effect by specifying different types.

check --type stat - Only use column statistics. check --type bloom-filter - Only use bloom filter. check --type predicate - Used in combination with column statistics and bloom filter.

Why are the changes needed?

ORC supports specifying multiple columns to generate bloom filter indexes, but it lacks a convenient tool to verify the effect of bloom filter.

Parquet also has similar commands. PARQUET-2138: Add ShowBloomFilterCommand to parquet-cli

How was this patch tested?

Add UT

Was this patch authored or co-authored using generative AI tooling?

No