addthis / stream-lib

Stream summarizer and cardinality estimator.
Apache License 2.0
2.26k stars 556 forks source link

Is there any way to let hll++ cardinality increasing? #122

Closed mtunique closed 7 years ago

mtunique commented 7 years ago

Is there any way to let hll++ cardinality bigger than before offer one new value?

abramsm commented 7 years ago

I do not follow the question. Can you provide an example?

On Dec 26, 2016 5:36 AM, "Tao Meng" notifications@github.com wrote:

Is there any way to let hll++ cardinality bigger than before offer one new value?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/addthis/stream-lib/issues/122, or mute the thread https://github.com/notifications/unsubscribe-auth/AABfa8DHgF3217dcQuPZBCwCSlgXkkuIks5rL7TqgaJpZM4LVwfj .

mtunique commented 7 years ago
        HyperLogLogPlus hyperLogLogPlus = new HyperLogLogPlus(16);
        for(int i = 0; i < 400000; i++) {
            long pre = hyperLogLogPlus.cardinality();
            hyperLogLogPlus.offer(i);
            long now = hyperLogLogPlus.cardinality();
            if (now < pre) {
                System.out.printf("%d %d\n", pre, now);
            }
        }

output:

327858 327682

Thanks.

abramsm commented 7 years ago

I'll look at code later but I suspect this is happening when the data structure transitions to sparse mode. Does it only happen once?

On Dec 26, 2016 5:41 AM, "Tao Meng" notifications@github.com wrote:

    HyperLogLogPlus hyperLogLogPlus = new HyperLogLogPlus(16);
    for(int i = 0; i < 400000; i++) {
        long pre = hyperLogLogPlus.cardinality();
        hyperLogLogPlus.offer(i);
        long now = hyperLogLogPlus.cardinality();
        if (now < pre) {
            System.out.printf("%d %d\n", pre, now);
            hyperLogLogPlus.cardinality();
        }
    }

output:

327858 327682

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/addthis/stream-lib/issues/122#issuecomment-269208000, or mute the thread https://github.com/notifications/unsubscribe-auth/AABfa0iRszFpWum8fz1wOhitC_qYRRTWks5rL7X-gaJpZM4LVwfj .

mtunique commented 7 years ago

Anytime and always these values 327858 327682.

abramsm commented 7 years ago

This is just a quirk of the probabilistic calculation. The register sum changes from

9454.113

to

9454.008

when you offer that additional on when i = 329514. See line 506 of HyperLogLogPlus to see the code i'm talking about. If you set a breakpoint for i=329514 you can step through and see what happens.

On a side note calling cardinality that often is very slow (not recommended for production use case).

Matt

On Mon, Dec 26, 2016 at 5:48 AM, Tao Meng notifications@github.com wrote:

Anytime and always these values 327858 327682.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/addthis/stream-lib/issues/122#issuecomment-269208546, or mute the thread https://github.com/notifications/unsubscribe-auth/AABfa3sfIQG_an9nU_w1SdqanrrMFwu1ks5rL7e2gaJpZM4LVwfj .

mtunique commented 7 years ago

Thanks @abramsm .

Do you have some trick methods to let cardinality bigger than before offer one new value ?

abramsm commented 7 years ago

I don't understand the question. Please try to explain what you are looking for in more detail.

On Mon, Dec 26, 2016 at 7:34 PM, Tao Meng notifications@github.com wrote:

Thanks @abramsm https://github.com/abramsm .

Do you have some trick methods to let cardinality bigger than before offer one new value ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/addthis/stream-lib/issues/122#issuecomment-269260124, or mute the thread https://github.com/notifications/unsubscribe-auth/AABfa0UORAEkfmDu2d5YDXOQpp0M0a83ks5rMHlIgaJpZM4LVwfj .

oertl commented 7 years ago

The current implementation does not guarantee monotonicity. The cardinality estimate might become smaller after inserting a new element. The root cause is that HyperLogLog++ is switching between different cardinality estimation approaches and the transitions are not always smooth. In addition, empirical data is interpolated to correct the bias, which could also lead to the observed behavior. In order to ensure monotonicity you probably need to use a different cardinality estimation algorithm that does not rely on empirical data and that uses only a single estimation approach for the full cardinality range. I am currently working on a paper that presents such new cardinality estimation algorithms, see https://github.com/oertl/hyperloglog-sketch-estimation-paper.

mtunique commented 7 years ago

Thanks all @oertl @abramsm .

abramsm commented 7 years ago

Otmar -

That paper looks very interesting. It will take me a while to truly digest it. It would be great if you can contribute an implementation of your new algorithm to stream-lib.

Matt

On Thu, Dec 29, 2016 at 2:22 PM, Otmar Ertl notifications@github.com wrote:

The current implementation does not guarantee monotonicity. The cardinality estimate might become smaller after inserting a new element. The root cause is that HyperLogLog++ is switching between different cardinality estimation approaches and the transitions are not always smooth. In addition, empirical data is interpolated to correct the bias, which could also lead to the observed behavior. In order to ensure monotonicity you probably need to use a different cardinality estimation algorithm that does not rely on empirical data and that uses only a single estimation approach for the full cardinality range. I am currently working on a paper that presents such new cardinality estimation algorithms, see https://github.com/oertl/hyperloglog-sketch-estimation-paper.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/addthis/stream-lib/issues/122#issuecomment-269695439, or mute the thread https://github.com/notifications/unsubscribe-auth/AABfax6a3HwJboaif1sh3Mzvme0IXvusks5rNCR_gaJpZM4LVwfj .

oertl commented 7 years ago

Matt,

I would like to, if I had more spare time. I still have to refine and finish my paper first.

Otmar