apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.26k stars 1.23k forks source link

Add HLL++ support for better accuracy and possibly lower memory cost #7014

Open lakshmanan-v opened 3 years ago

lakshmanan-v commented 3 years ago

DISTINCTCOUNTHLL accuracy and memory footprint can be improved through latest HLL algorithms. We have a choice either replace the existing implementation with a better one or leave the existing DISTINCTCOUNTHLL to implement original HLL and create separate functions (ex: DISTINCTCOUNTHLLPLUSPLUS).

Google's HLL++ -- a popular algorithm amongst the community offers lot of improvements over original HLL. There are multiple java implementations of HLL++. Most of them have variations in performance due to the register size and other implementation choices. Clearspring stream-lib used for current HyperLogLog function, implements HLL++ as HyperLogPlus.

Jackie-Jiang commented 3 years ago

Another alternative is the Data Sketch HLL: https://datasketches.apache.org/docs/HLL/HLL.html They claim better performance in this paper: https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html

lakshmanan-v commented 3 years ago

Good point @Jackie-Jiang. We could add this as a separate function DISTINCTCOUNDATASKETCHHLL (similar to DISTINCTCOUNTTHETASKETCH). It would be nice to document the parameters, accuracy and space requirements of each of HLL implementations as we have a handful of them already. Users can chose the right one based on their need.