addthis / stream-lib

Stream summarizer and cardinality estimator.
Apache License 2.0
2.26k stars 556 forks source link

Typos, documentation and parameters #64

Open s-j opened 10 years ago

s-j commented 10 years ago

Hi guys,

thanks for such a nice collection of algorithms, I've been playing a bit with and stumbled over a few things I would like to clarify:

final int r = Long.numberOfLeadingZeros((hashedValue << this.log2m) | (1 << (this.log2m - 1)) + 1;

Thanks and cheers, Simon

abramsm commented 10 years ago

Simon -

Thanks!

final int r = Long.numberOfLeadingZeros((hashedValue << this.log2m) | (1 << (this.log2m - 1)) + 1;

MA - this is a little bit tricky but the code is correct as it is written currently. We are adding 1 to the function retreiving the number of leading zeros and then adding 1 to the number returned from that function.

TestLogLog uses 1.04 as error estimate parameter (beta), while other sources, e.g. http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining and the original Falojet's papers, uses 1.30. What is correct?

MA - 1.30 is correct for LogLog, HyperLogLog improved the value to 1.04.

Furthermore, TestLogLog asserts the estimated value to be within 2 se from the actual value, while TestHyperLogLog checks if it is within 3 \ se? Is there any good reason for this?

MA - you are correct, it should not be using the 2x, however the tests never passed with that tolerance while using the default hash function. The guava128 offered hash provides better results. This test is ignored and has some other issues. Bottom line is LogLog should probably be deprecated and removed from future releases given that HLLP is far superior in performance and size.

HyperLogLog:94,106 mentions 1.106 vs 1.04, which is correct? (Googling 1.106 led me to the beta-32-bound given in the Flajolet's 2007 paper, while 1.04 corresponds to beta-infinity.)

MA - I am not sure. I cannot recall why I choose 1.106. All tests seem to pass with both values and I have not done a more detailed analysis on which provides better results. Flajolet's 2007 paper provides these beta values:

b_16=1.106, b_32=1.070, b_64=1.054, b_128=1.04, b_infinity=1.039

Do you have thoughts on what the correct value is? I think, at this point, my recommendation for new implementations would be to use HLLP rather than HLL.

CountThenEstimate:217 -- perhaps it should be LogLog(bytes) instead of LinearCounting(bytes)?

MA - yes, good catch!

Thanks, Matt

On Fri, Jan 31, 2014 at 9:20 AM, Simon Jonassen notifications@github.com wrote:

Hi guys,

thanks for such a nice collection of algorithms, I've been playing a bit with and stumbled over a few things I would like to clarify:

HyperLogLog:151,161 seems to have a redundant "+ 1", perhaps appeared on copy-paste from LogLog, suggested correction:

final int r = Long.numberOfLeadingZeros((hashedValue << this.log2m) | (1 << (this.log2m - 1)) + 1;

TestLogLog uses 1.04 as error estimate parameter (beta), while other sources, e.g. http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining and the original Falojet's papers, uses 1.30. What is correct? Furthermore, TestLogLog asserts the estimated value to be within 2 * se from the actual value, while TestHyperLogLog checks if it is within 3 * se? Is there any good reason for this? HyperLogLog:94,106 mentions 1.106 vs 1.04, which is correct? (Googling 1.106 led me to the beta-32-bound given in the Flajolet's 2007 paper, while 1.04 corresponds to beta-infinity.) CountThenEstimate:217 -- perhaps it should be LogLog(bytes) instead of LinearCounting(bytes)?

Thanks and cheers, Simon

Reply to this email directly or view it on GitHub.