addthis / stream-lib

Stream summarizer and cardinality estimator.
Apache License 2.0
2.26k stars 558 forks source link

HLL++ offer() always returns true when format SPARSE #91

Open xkrt opened 9 years ago

xkrt commented 9 years ago

Hi,

HyperLogLogPlus.offerHashed() constantly returns true in SPARSE format, even if hash already counted.

https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java#L309

Test:

@Test
public void testOffer() {
    HyperLogLogPlus hll = new HyperLogLogPlus(5,25);
    assertTrue(hll.offer("ABC"));
    assertFalse(hll.offer("ABC")); // boom!
}
abramsm commented 9 years ago

Confirmed that it returns true. This behavior doesn't actually result in incorrect cardinality results but it does violate the interface's contract.

The issue is that for efficiency purposes we do not merge the temp set until it reaches a certain size or someone tries to get the cardinality represented by the set.

In order to correctly return 'false' on each entry we would have to merge the temp set after each offer and this would slow things down quite a bit.

What are peoples thoughts on optimal solution here? We should at least add some javadoc to explain why this is and consider changing the interface contract.

Matt

On Thu, Jul 2, 2015 at 4:32 AM Pavel Martynov notifications@github.com wrote:

Hi,

HyperLogLogPlus.offerHashed() constantly returns true in SPARSE format, even if hash already counted.

https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/cardinality/HyperLogLogPlus.java#L309

Test:

@Testpublic void testOffer() { HyperLogLogPlus hll = new HyperLogLogPlus(5,25); assertTrue(hll.offer("ABC")); assertFalse(hll.offer("ABC")); // boom! }

— Reply to this email directly or view it on GitHub https://github.com/addthis/stream-lib/issues/91.

xkrt commented 9 years ago

IMHO you should measure how much will slow down contract support and if it really quite a bit - fix offer() to return correct value.

I want to explain why I need it: my business logic should perfom some notification when counter becomes larger than some predefined value. Pseudocode:

if (hll.offer(value) == true) {
    if (hll.cardinality() > notifyBound) {
        notify();
    }
}

If offer() will always return true (in Sparse format) I will be forced always call cardinality() that expensive as far I understand. Moreover stream of all values is very big but new values comes to counter relatively rarely.

And I think contract violation with explanation in docs in general bad idea leading to poor API.

Before stream-lib I used Redis (http://redis.io/commands/pfadd).

abramsm commented 9 years ago

I've created a branch that provides the correct behavior. Basic performance tests seem OK but more testing on that front is required before merging into master. I've also updated language to Java 8 because, why not...

Are you interested in helping with performance analysis?

https://github.com/addthis/stream-lib/commit/f4c88af1061b5c1b5c15e4aa81dfd6c1f746780f

Matt

On Fri, Jul 3, 2015 at 12:36 AM Pavel Martynov notifications@github.com wrote:

IMHO you should measure how much will slow down contract support and if it really quite a bit - fix offer() to return correct value.

I want to explain why I need it: my business logic should perfom some notification when counter becomes larger than some predefined value. Pseudocode:

if (hll.offer(value) == true) { if (hll.cardinality() > notifyBound) { notify(); } }

If offer() will be always return true (in Sparse format) I will be forced always call cardinality() that expensive as far I understand. Moreover stream of all values is very big but new values comes to counter relatively rarely.

And I think contract violation with explanation in docs in general bad idea leading to poor API.

Before stream-lib I used Redis (http://redis.io/commands/pfadd).

— Reply to this email directly or view it on GitHub https://github.com/addthis/stream-lib/issues/91#issuecomment-118251067.

xkrt commented 9 years ago

@abramsm I will try to compare perf on my dataset on next week. thanks!

xkrt commented 9 years ago

@abramsm I run fixed offer() on one of my data sets (~10e6 items) and performance just sliiiightly changed. So I think your fix is production ready and may be merged.

abramsm commented 9 years ago

Thanks. Can you share any more details and results of your performance tests? A few folks have expressed concerns about this approach so we will need to give them time to propose/implement alternatives.

Matt

On Mon, Jul 6, 2015, 1:50 AM Pavel Martynov notifications@github.com wrote:

@abramsm https://github.com/abramsm I run fixed offer() on one of my data sets (~10e6 items) and performance just sliiiightly changed. So I think your fix is production ready and may be merged.

— Reply to this email directly or view it on GitHub https://github.com/addthis/stream-lib/issues/91#issuecomment-118761855.

mspiegel commented 9 years ago

I have two issues with the proposed changes: (1) I don't see the purpose of keeping around the tmpSet and the sparseSet if we introduce the transientSparseSet. (2) Presumably the tmpSet and the sparseSet were introduced because of some combination of (a) lower memory footprint and/or (b) amortized cost of insertions into those data structures is cheaper than using a hash set. Maybe we could use a specialized hash table for int primitives for the transientSparseSet and eliminate the tmpSet and the sparseSet. I'm not certain. It's likely that @tea-dragon has some objections too.

abramsm commented 9 years ago

Yeah, those are good points and I think why we didn't do this initially.

Pavel - can you try your performance tests again using the current release? And modify your code to always call 'cardinality' rather than waiting for a 'true' response from offer. Lets see what the performance looks like with that as compared to the hack in this branch.

Matt

On Mon, Jul 6, 2015 at 8:37 AM mspiegel notifications@github.com wrote:

I have two issues with the proposed changes: (1) I don't see the purpose of keeping around the tmpSet and the sparseSet if we introduce the transientSparseSet. (2) Presumably the tmpSet and the sparseSet were introduced because of some combination of (a) lower memory footprint and/or (b) amortized cost of insertions into those data structures is cheaper than using a hash set. Maybe we could use a specialized hash table for int primitives for the transientSparseSet and eliminate the tmpSet and the sparseSet. I'm not certain. It's likely that @tea-dragon https://github.com/tea-dragon has some objections too.

— Reply to this email directly or view it on GitHub https://github.com/addthis/stream-lib/issues/91#issuecomment-118874890.

xkrt commented 9 years ago

While I am experementing with performance of fixed offer() I found a problem:

@Test
public void testOfferReturn() {
    HyperLogLogPlus hll = new HyperLogLogPlus(5, 25);
    int uniqOffers = 10000;
    int hllUpdates = 0;
    for (int i = 0; i < uniqOffers; ++i) {
        if (hll.offer(UUID.randomUUID().toString()))
            ++hllUpdates;
    }
    assertEquals(uniqOffers, hllUpdates);
}

Test fails with:

Expected :10000
Actual   :166

Test run on f4c88af1061b5c1b5c15e4aa81dfd6c1f746780f

abramsm commented 9 years ago

This has to do with the precision properties of the counter. In normal mode the buckets do not update every time a new hash is offered to the set. The higher the precision the more likely a bucket is to update. If you change your test from:

HyperLogLogPlus(5,25)

to

HyperLogLogPlus(25,25)

You will get a result like:

Expected :10000 Actual :9998

Which is about as close as you can get.

Give this fact I think the proper course of action is to update the interface documentation rather than changing the code. Michael Spiegel proposed something like:

"returns true if it alters the internal state of the of the cardinality estimator"

Matt

On Wed, Jul 8, 2015 at 2:23 AM Pavel Martynov notifications@github.com wrote:

While I am experementing with performance of fixed offer() I found a problem:

@Testpublic void testOfferReturn() { HyperLogLogPlus hll = new HyperLogLogPlus(5, 25); int uniqOffers = 10000; int hllUpdates = 0; for (int i = 0; i < uniqOffers; ++i) { if (hll.offer(UUID.randomUUID().toString())) ++hllUpdates; } assertEquals(uniqOffers, hllUpdates); }

Test fails with:

Expected :10000 Actual :166

Test run on f4c88af https://github.com/addthis/stream-lib/commit/f4c88af1061b5c1b5c15e4aa81dfd6c1f746780f

— Reply to this email directly or view it on GitHub https://github.com/addthis/stream-lib/issues/91#issuecomment-119490320.

xkrt commented 9 years ago

@abramsm auh, I see, you are right. I have some additional questions about your HLL++ implementation, will ask it in mail list. Thanks!