datafuselabs / databend

𝗗𝗮𝘁𝗮, 𝗔𝗻𝗮𝗹𝘆𝘁𝗶𝗰𝘀 & 𝗔𝗜. Modern alternative to Snowflake. Cost-effective and simple for massive-scale analytics. https://databend.com
https://docs.databend.com
Other
7.64k stars 726 forks source link

Feature: Try Support BloomFilter Collision #14928

Open JackTan25 opened 5 months ago

JackTan25 commented 5 months ago

Summary We can use runtime filter's bloom index to do collision with parquet block's bloom filter index to do prune in storage level. When do parquet reading, we can improve filter chances.

JackTan25 commented 5 months ago

cc @dantengsky

JackTan25 commented 5 months ago

https://openproceedings.org/2023/conf/edbt/paper-190.pdf for https://github.com/datafuselabs/databend/pull/14970, we find out that in some cases, the false positive is very high, so we can't prune blocks as expected. We introduce BloomRF to solve this which is newer than surf paper.