Open oertl opened 9 years ago
Hello, I was able to reproduce this issue too with the following code snippet in scala
object Main {
def main(args: Array[String]) {
(1 to 5).foreach(_ => run(10 * 1000 * 1000L, 20, 25))
println()
(1 to 5).foreach(_ => run(10 * 1000 * 1000L, 20, 32))
}
def run(size: Long, p: Int, sp: Int): Unit = {
val streamlib = new HyperLogLogPlus(p, sp)
var i: Long = 0L
while (i < size) {
val uuid = UUID.randomUUID().toString
streamlib.offer(uuid)
i += 1
}
printf("p: %s, sp: %s -- exact: %s, estimated: %s, error: %f%%%n",
p, sp, size, streamlib.cardinality(),
100 * Math.abs((streamlib.cardinality() - size) / size.toFloat)
)
}
}
... and got the following results
p: 20, sp: 25 -- exact: 10000000, estimated: 9988177, error: 0.118230%
p: 20, sp: 25 -- exact: 10000000, estimated: 10001031, error: 0.010310%
p: 20, sp: 25 -- exact: 10000000, estimated: 9998517, error: 0.014830%
p: 20, sp: 25 -- exact: 10000000, estimated: 9993139, error: 0.068610%
p: 20, sp: 25 -- exact: 10000000, estimated: 10009278, error: 0.092780%
p: 20, sp: 32 -- exact: 10000000, estimated: 9924770, error: 0.752300%
p: 20, sp: 32 -- exact: 10000000, estimated: 9920361, error: 0.796390%
p: 20, sp: 32 -- exact: 10000000, estimated: 9932979, error: 0.670210%
p: 20, sp: 32 -- exact: 10000000, estimated: 9917889, error: 0.821110%
p: 20, sp: 32 -- exact: 10000000, estimated: 9916722, error: 0.832780%
If we try to estimate the error rate boundaries for p=20
we should get 3 * 1.04 / sqrt (2^20)
which is 0.003046875
or 0.3046875%
, but error rate for p=20,sp=32
is much higher.
Given the described behaviour, should the implementation be patched to allow only the following input values 4 <= p <= sp <= 25
as @oertl suggested?
If so, I can submit a patch to fix it.
The HyperLogLogPlus implementation only supports parameters p and sp for which 4 <= p <= sp <= 25. However, there is only a check for 4 <= p <= sp <= 32. Here is some test code with sp=26 that demonstrates the problem:
The expected cardinality is 2, because the index (first 25 bits) as well as the sparse index (first 26 bits) are different.