aggregateknowledge / java-hll

Java library for the HyperLogLog algorithm
http://research.neustar.biz/2013/12/24/open-source-release-java-hll/
Apache License 2.0
312 stars 71 forks source link

Problem with toBytes and fromBytes (error is 76%, too big) #9

Closed ngocdaothanh closed 10 years ago

ngocdaothanh commented 10 years ago

test1 (Scala code) gives error of less than 2%:

import java.nio.charset.Charset
import com.google.common.hash.Hashing
import net.agkn.hll.HLL

def test1() {
  val charset = Charset.forName("UTF8")
  val m3 = Hashing.murmur3_128
  val hll = new HLL(13, 5)

  for (i <- 1 to 74570) {
    val dau1 = hll.cardinality.toInt

    val hashCode = m3.hashString("" + i, charset)
    hll.addRaw(hashCode.asLong)

    val dau2 = hll.cardinality.toInt
    if (dau1 == dau2) println(s"dau1 = dau2 = $dau1, i = $i")
  }
  println(hll.cardinality.toInt)
}

test2 gives error of 76%!!!

def test2() {
  val charset = Charset.forName("UTF8")
  val m3 = Hashing.murmur3_128  

  var b: Array[Byte] = null
  for (i <- 1 to 74570) {
    val hll = if (b == null) new HLL(13, 5) else HLL.fromBytes(b)
    val dau1 = hll.cardinality.toInt

    val hashCode = m3.hashString("" + i, charset)
    hll.addRaw(hashCode.asLong)

    val dau2 = hll.cardinality.toInt
    if (dau1 == dau2) println(s"dau1 = dau2 = $dau1, i = $i")

    if (dau1 != dau2) b = hll.toBytes
  }

  val hll = if (b == null) new HLL(13, 5) else HLL.fromBytes(b)
  println(hll.cardinality.toInt)
}

The output looks like this (note that the value drops dramatically from 73818 to 17757):

...
dau1 = dau2 = 73818, i = 74560
dau1 = dau2 = 73818, i = 74561
dau1 = dau2 = 73818, i = 74562
dau1 = dau2 = 17757, i = 74565
dau1 = dau2 = 17760, i = 74567
...
yerenkow commented 10 years ago

I think you probably have bug here:

... if (dau1 != dau2) b = hll.toBytes ...

if in iteration X, you'll get dau1 = dau2, then you'll stuck with b from (X-1); And you could easily reiterate many times (X+1, X+2.... X+N) until you'll get some value which make dau1 != dau2.

I think if you understand more about cardinality calculation, you'll see why. There is 3 ways of calculating cardinality (I'm talking about Sparse/Full mode). While storage not saturated, estimate returns using method1. After you fill in pretty much values, starting method2, and when your storage is filled up to some high level, there's method3. Since probabilistic errors in cardinality estimate, you could add value into storage, and it tells you that cardinality not changed. Also, there's can happen drop in cardinality in between methods1 & 2 and between method 2 & 3. What I'm saying is while method 1 gives you for example 1215, after adding new value internal way could be switched to method 2, which can give you 1213.

So, there's not guarantee that after each adding value your cardinality grows (even if you really adding fresh unseen new value). If you fix your logic according to that, you could write another error estimator.

ngocdaothanh commented 10 years ago

Thanks for the explanation. I noticed that too, but I thought that the cardinality should not change dramatically for any input, to keep the error below "a few percent".

So is this normal behavior?

ngocdaothanh commented 10 years ago

I've reread the README and it says about several algorithms: https://github.com/aggregateknowledge/java-hll#algorithms

How can I use the EXPLICIT algorithm? I'm counting DAU (daily access users), and I only need to support up to a certain limit of DAU.

yerenkow commented 10 years ago

Explicit is used by default until you'll fill up your storage (which size depends from log2m and regWidth). This logic is simple - it's just array of ids. For such approach you don't need actually HLL. If you still want to use it - tune up log2m and regWidth.

Explicit, Sparse and Full is methods of storing values; And for sparse and full there is three slightly different methods of calculating cardinaltiy. For explicit mode cardinality = size of collected values in array.

blinsay commented 10 years ago

@ngocdaothanh what version of hll are you using? There was an issue with fromBytes/toByes that was fixed in 1.5.1 (check out de5cc2d).

ngocdaothanh commented 10 years ago

I'm using 1.5.0. You should update the README because it still says that the latest version is 1.5.0, not 1.5.1. (With 1.5.1, test2 gives the same result.)

After reading the tests at: https://github.com/aggregateknowledge/java-hll/tree/master/src/test/java/net/agkn/hll I see that there are some variants of the HLL constructor. I think you should mention them in the README too.

Thanks for the explanations. I guess this issue can be closed now.

blinsay commented 10 years ago

The README is updated to show 1.5.1 as the latest version. Sorry about the confusion!