Open lakshmanan-v opened 3 years ago
Another alternative is the Data Sketch HLL: https://datasketches.apache.org/docs/HLL/HLL.html They claim better performance in this paper: https://datasketches.apache.org/docs/HLL/Hll_vs_CS_Hllpp.html
Good point @Jackie-Jiang. We could add this as a separate function DISTINCTCOUNDATASKETCHHLL (similar to DISTINCTCOUNTTHETASKETCH). It would be nice to document the parameters, accuracy and space requirements of each of HLL implementations as we have a handful of them already. Users can chose the right one based on their need.
DISTINCTCOUNTHLL accuracy and memory footprint can be improved through latest HLL algorithms. We have a choice either replace the existing implementation with a better one or leave the existing DISTINCTCOUNTHLL to implement original HLL and create separate functions (ex: DISTINCTCOUNTHLLPLUSPLUS).
Google's HLL++ -- a popular algorithm amongst the community offers lot of improvements over original HLL. There are multiple java implementations of HLL++. Most of them have variations in performance due to the register size and other implementation choices. Clearspring stream-lib used for current HyperLogLog function, implements HLL++ as HyperLogPlus.