Increase DEFAULT_VALUE_PRUNER_IN_PREDICATE_THRESHOLD or derive the value with some simple policy

For IN predicate, ColumnValueSegmentPruner prune segments based on:

Column min/max value
Column bloom filter

However, we will skip pruning when there are too many values in the IN predicate, currently we skip pruning when IN predicates has more than 10 items: https://github.com/apache/pinot/blob/12edbdb1e521e0d9ee35a2d6ef2fed172e5beaa1/pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java#L287

The value 10 here is quite empirical. IIUC, the rationale is: default FPP (false positive probability) of Bloom Filter is 0.05:https://github.com/apache/pinot/blob/7e9ca6a5a4afe0d4e283ac1307c45430e474cbf2/pinot-spi/src/main/java/org/apache/pinot/spi/config/table/BloomFilterConfig.java#L28

Which means even if a segments does not contains a given value, BloomFilter will has a 5% probability falsely answering that the segment contains the value.

1 - 95% ^ 10 = 0.4

So, if IN predicate has 10 values, even if a segment doe NOT contains any one of the 10 values, BloomFilter will has a 40% probability falsely answering that the segment can NOT be pruned. Let's say we have a query set:

all the queries are IN predicate queries
all the IN predicate has 10 values

Say we have a table, the table does not contains any one of the 10 values of the IN predicate:

If the table only has 1 segment, the segment does not contains any one of the 10 values. BloomFilter will has a 40% probability falsely answering that the segment can NOT be pruned ---> We can see around 60% of our query super fast (as the segment get pruned), and the other 40% query very slow (need load the segment into memory and do a full scanning)
If the table has 2 segments, then for each segments BloomFilter will has a 40% probability falsely answering that the segment can NOT be pruned ---> we will only get 1 segments pruned --> it does not help too much in terms of decreasing query latency.
If the table has more than 2 segments, it's very likely that at least 1 segments can not be pruned --> consulting BloomFilter does not help too much in terms of reducing query latency.

I think that's the reason why BloomFilter does not help too much (in terms of decreasing query latency) when IN predicate has more than 10 values.

But if we look at this problem from system resource usage point of view, a pinot segments is usually 100MB to 500MB, loading a segment form disk to memory and do a full segment scanning is much much more expensive than consulting BloomFilter (thousands, even ten thousands times more expensive IMO).

For example: If we assume loading a segment form disk to memory and do a full segment scanning is 100 times expensive than consulting BloomFilter, we should set DEFAULT_VALUE_PRUNER_IN_PREDICATE_THRESHOLD as 90 ( 1 - 95% ^ 90 = 0.99). Let me elaborate:

Say consulting BloomFilter 90 times takes 1K CPU cycles, and 1KB memory
Say loading a segment form disk to memory and do a full segment scanning need 100K CPU cycles, and 100KB memory. (100 times expensive than consulting BloomFilter)
Say we have 100 queries, each queries has IN predicate with 90 values. The table only has 1 segment and it does not contains any one of the 90 values at all.
If we enable BloomFilter pruning. We are able to prune the segment for only 1 query, for the other 99 queries, we are unable to prune. Total resource usage to executing the query set: 1K + 99 1K + 99 100K = 10M CPU cycles and 10MB memory
If we disable BloomFilter pruning, total resource usage to executing the query set: 100 * 100K = 10M CPU cycles and 10MB memory
If IN predicate has more than 90 values, enabling BloomFilter pruning will use more resource;

In conclusion, increasing DEFAULT_VALUE_PRUNER_IN_PREDICATE_THRESHOLD more than 10 does not help too much in terms of reducing query latency, but will be very helpful to decrease overall system resource usage.

In addition, MinMaxRange pruning does not has the false positive problem as BloomFilter does, we may should always enabling MinMaxRange pruning.

cc @mcvsubbu @siddharthteotia @GSharayu

apache / pinot

Increase DEFAULT_VALUE_PRUNER_IN_PREDICATE_THRESHOLD or derive the value with some simple policy #7913