alecmocatta / streaming_algorithms

Performant implementations of various streaming algorithms, including Count–min sketch, Top k, HyperLogLog, Reservoir sampling.
Apache License 2.0
85 stars 11 forks source link

better HLL estimator #17

Open jianshu93 opened 11 months ago

jianshu93 commented 11 months ago

Hello Team,

It seems that new cardinality estimator paper has two new estimators, the improved estimator and MLE estimator, better than the one originally proposed, see paper here: https://arxiv.org/pdf/1706.07290.pdf

Just wondering whether it worth trying, C++ version can be found in the paper.

Thank you,

Jianshu

LucaCappelletti94 commented 3 months ago

@jianshu93 FYI, I implemented that estimator in my hyperloglog - it is still slow AF and only statistically better for what entails the union estimation. Maybe with better optimizers, one may get better performance, but when there is a need to compute logarithms in a core loop things can only go that fast.

jianshu93 commented 3 months ago

Hi @LucaCappelletti94,

Thanks for letting me know! Did you benchmarked against the MLE and improved estimator in Ertl's paper, as implemented in SourMash (https://docs.rs/sourmash/0.15.0/sourmash/sketch/hyperloglog/estimators/index.html)? This is a direct import of the C++ version from the original paper. It would be nice to have a benchmark and I would be happy to play with your implementation.

Best,

Jianshu

LucaCappelletti94 commented 3 months ago

I am running several benchmarks, I can add also that one in. Thanks for pointing that one out.

jianshu93 commented 3 months ago

And also this one, hypertwobits(https://github.com/axiomhq/hypertwobits/tree/main), a very new one, paper was published just last month. I am a little bit concerned about it since the benchmarks showed very large variation even than the original hyperloglog. Let me know your thoughts on it also.

Jianshu

LucaCappelletti94 commented 3 months ago

Hi @LucaCappelletti94,

Thanks for letting me know! Did you benchmarked against the MLE and improved estimator in Ertl's paper, as implemented in SourMash (https://docs.rs/sourmash/0.15.0/sourmash/sketch/hyperloglog/estimators/index.html)? This is a direct import of the C++ version from the original paper. It would be nice to have a benchmark and I would be happy to play with your implementation.

Best,

Jianshu

FYI, I have just read their joint estimation algo and it does not match the one from Otmar. I am quite sure of the correctness of my implementation since I bothered Otmar several times to check it together, so the only thing that remains to verify is which one is more accurate and their speeds.

LucaCappelletti94 commented 3 months ago

And also this one, hypertwobits(https://github.com/axiomhq/hypertwobits/tree/main), a very new one, paper was published just last month. I am a little bit concerned about it since the benchmarks showed very large variation even than the original hyperloglog. Let me know your thoughts on it also.

Jianshu

Yeah, those results seem sus, especially for a 2-bit register thinghy. We'll see in a bit, currently adding to lineup.

jianshu93 commented 3 months ago

I would also be interested in UltraLogLog (https://dl.acm.org/doi/abs/10.14778/3654621.3654632) and ExaLogLog (https://arxiv.org/abs/2402.13726). But no implementation is available in Rust. All Ertl's implementation is in the hash4j java library, which I do not like. Since you have so many benchmarks there, I see it as a paper describing and available HLL-like implementations.

Jianshu

LucaCappelletti94 commented 3 months ago

I would also be interested in UltraLogLog (https://dl.acm.org/doi/abs/10.14778/3654621.3654632) and ExaLogLog (https://arxiv.org/abs/2402.13726). But no implementation is available in Rust. All Ertl's implementation is in the hash4j java library, which I do not like.

Jianshu

Samesies on Java. A bit hard to do a fair comparison there and I do not have the time to re-implement it, at least, not now.