Closed ngocdaothanh closed 10 years ago
I think you probably have bug here:
... if (dau1 != dau2) b = hll.toBytes ...
if in iteration X, you'll get dau1 = dau2, then you'll stuck with b from (X-1); And you could easily reiterate many times (X+1, X+2.... X+N) until you'll get some value which make dau1 != dau2.
I think if you understand more about cardinality calculation, you'll see why. There is 3 ways of calculating cardinality (I'm talking about Sparse/Full mode). While storage not saturated, estimate returns using method1. After you fill in pretty much values, starting method2, and when your storage is filled up to some high level, there's method3. Since probabilistic errors in cardinality estimate, you could add value into storage, and it tells you that cardinality not changed. Also, there's can happen drop in cardinality in between methods1 & 2 and between method 2 & 3. What I'm saying is while method 1 gives you for example 1215, after adding new value internal way could be switched to method 2, which can give you 1213.
So, there's not guarantee that after each adding value your cardinality grows (even if you really adding fresh unseen new value). If you fix your logic according to that, you could write another error estimator.
Thanks for the explanation. I noticed that too, but I thought that the cardinality should not change dramatically for any input, to keep the error below "a few percent".
So is this normal behavior?
I've reread the README and it says about several algorithms: https://github.com/aggregateknowledge/java-hll#algorithms
How can I use the EXPLICIT
algorithm? I'm counting DAU (daily access users), and I only need to support up to a certain limit of DAU.
Explicit is used by default until you'll fill up your storage (which size depends from log2m and regWidth). This logic is simple - it's just array of ids. For such approach you don't need actually HLL. If you still want to use it - tune up log2m and regWidth.
Explicit, Sparse and Full is methods of storing values; And for sparse and full there is three slightly different methods of calculating cardinaltiy. For explicit mode cardinality = size of collected values in array.
@ngocdaothanh what version of hll are you using? There was an issue with fromBytes/toByes that was fixed in 1.5.1 (check out de5cc2d).
I'm using 1.5.0. You should update the README because it still says that the latest version is 1.5.0, not 1.5.1. (With 1.5.1, test2 gives the same result.)
After reading the tests at: https://github.com/aggregateknowledge/java-hll/tree/master/src/test/java/net/agkn/hll I see that there are some variants of the HLL constructor. I think you should mention them in the README too.
Thanks for the explanations. I guess this issue can be closed now.
The README is updated to show 1.5.1 as the latest version. Sorry about the confusion!
test1 (Scala code) gives error of less than 2%:
test2 gives error of 76%!!!
The output looks like this (note that the value drops dramatically from 73818 to 17757):