addthis / stream-lib

Stream summarizer and cardinality estimator.
Apache License 2.0
2.26k stars 558 forks source link

Argument checks in HyperLogLogPlus constructor need to be more restrictive #95

Open oertl opened 9 years ago

oertl commented 9 years ago

The HyperLogLogPlus implementation only supports parameters p and sp for which 4 <= p <= sp <= 25. However, there is only a check for 4 <= p <= sp <= 32. Here is some test code with sp=26 that demonstrates the problem:

        HyperLogLogPlus hll = new HyperLogLogPlus(25, 26);

        hll.offerHashed(0X7FFFFF8000000000L);
        hll.offerHashed(0XFFFFFF8000000000L); 

        assertEquals(2, hll.cardinality()); // fails, hll.cardinality() returns 1 

The expected cardinality is 2, because the index (first 25 bits) as well as the sparse index (first 26 bits) are different.

szhem commented 7 years ago

Hello, I was able to reproduce this issue too with the following code snippet in scala

object Main {      
  def main(args: Array[String]) {
    (1 to 5).foreach(_ => run(10 * 1000 * 1000L, 20, 25))
    println()
    (1 to 5).foreach(_ => run(10 * 1000 * 1000L, 20, 32))
  }

  def run(size: Long, p: Int, sp: Int): Unit = {
    val streamlib = new HyperLogLogPlus(p, sp)

    var i: Long = 0L
    while (i < size) {
      val uuid = UUID.randomUUID().toString
      streamlib.offer(uuid)
      i += 1
    }

    printf("p: %s, sp: %s -- exact: %s, estimated: %s, error: %f%%%n",
      p, sp, size, streamlib.cardinality(), 
      100 * Math.abs((streamlib.cardinality() - size) / size.toFloat)
    )
  }
}

... and got the following results

p: 20, sp: 25 -- exact: 10000000, estimated: 9988177, error: 0.118230%
p: 20, sp: 25 -- exact: 10000000, estimated: 10001031, error: 0.010310%
p: 20, sp: 25 -- exact: 10000000, estimated: 9998517, error: 0.014830%
p: 20, sp: 25 -- exact: 10000000, estimated: 9993139, error: 0.068610%
p: 20, sp: 25 -- exact: 10000000, estimated: 10009278, error: 0.092780%

p: 20, sp: 32 -- exact: 10000000, estimated: 9924770, error: 0.752300%
p: 20, sp: 32 -- exact: 10000000, estimated: 9920361, error: 0.796390%
p: 20, sp: 32 -- exact: 10000000, estimated: 9932979, error: 0.670210%
p: 20, sp: 32 -- exact: 10000000, estimated: 9917889, error: 0.821110%
p: 20, sp: 32 -- exact: 10000000, estimated: 9916722, error: 0.832780%

If we try to estimate the error rate boundaries for p=20 we should get 3 * 1.04 / sqrt (2^20) which is 0.003046875 or 0.3046875%, but error rate for p=20,sp=32 is much higher.

Given the described behaviour, should the implementation be patched to allow only the following input values 4 <= p <= sp <= 25 as @oertl suggested?

If so, I can submit a patch to fix it.