This PR proposes a different gcs filter querying mechanism, intended for having better performance as the number of query entries surpasses the number of elements in the filter.
As the number of elements in the query grows, allocating and sorting the elements begins to dominate the runtime. The solution then for large queries is inspired by a hash join, which makes no assumptions on the input ordering of either set. Since the number of filters is ultimately bounded by the block size, the filter entries are chosen as the hash index so that the setup latency is minimized.
Complexity
Number of filter entries: F
Number of query entries: Q
Assumption: Q > F
This PR proposes a different gcs filter querying mechanism, intended for having better performance as the number of query entries surpasses the number of elements in the filter.
As the number of elements in the query grows, allocating and sorting the elements begins to dominate the runtime. The solution then for large queries is inspired by a hash join, which makes no assumptions on the input ordering of either set. Since the number of filters is ultimately bounded by the block size, the filter entries are chosen as the hash index so that the setup latency is minimized.
Complexity
Number of filter entries: F Number of query entries: Q Assumption: Q > F
Setup
Online
Benchmarks
Zip w/ 5K Filter Elements
Zip w/ 10K Filter Elements
Hash-Join w/ 5K Filter Elements
Hash-Join w/ 10K Filter Elements
Hybrid w/ 5k Filter Elements
Hybrid w/ 10k Filter Elements
Ratio Zip/Hash