apache / datasketches-cpp

Core C++ Sketch Library
https://datasketches.apache.org
Apache License 2.0
225 stars 71 forks source link

theta_a_not_b misbehaving on ordered sketches #152

Closed Lordshinjo closed 4 years ago

Lordshinjo commented 4 years ago

When called on 2 ordered sketches, it seems that a_not_b is severely overestimating its result in some cases, with a_not_b(a, b) having an estimate bigger than a if b is bigger than a.

Here is a small reproduction (on the current master):

>>> import random
>>> import uuid
>>> import datasketches
>>> ids = [str(uuid.uuid4()) for _ in range(100_000)]
>>> a_ids = random.sample(ids, 10_000)
>>> b_ids = random.sample(ids, 25_000)
>>> len(set(a_ids) - set(b_ids))
7548
>>> a = datasketches.update_theta_sketch()
>>> b = datasketches.update_theta_sketch()
>>> for x in a_ids:
...     a.update(x)
...
>>> for x in b_ids:
...     b.update(x)
...
>>> a_compact = a.compact()
>>> b_compact = b.compact()
>>> print(a)
### Update Theta sketch summary:
   lg nominal size      : 12
   lg current size      : 13
   num retained keys    : 5333
   resize factor        : 8
   sampling probability : 1
   seed hash            : 37836
   empty?               : false
   ordered?             : false
   estimation mode?     : true
   theta (fraction)     : 0.533299
   theta (raw 64-bit)   : 4918811584238243237
   estimate             : 10000
   lower bound 95% conf : 9813.24
   upper bound 95% conf : 10190.3
### End sketch summary

>>> print(b)
### Update Theta sketch summary:
   lg nominal size      : 12
   lg current size      : 13
   num retained keys    : 7074
   resize factor        : 8
   sampling probability : 1
   seed hash            : 37836
   empty?               : false
   ordered?             : false
   estimation mode?     : true
   theta (fraction)     : 0.28443
   theta (raw 64-bit)   : 2623405683183212657
   estimate             : 24870.8
   lower bound 95% conf : 24373.3
   upper bound 95% conf : 25378.4
### End sketch summary

>>> for a_ in (a, a_compact):
...     for b_ in (b, b_compact):
...             print(datasketches.theta_a_not_b().compute(a_, b_))
...
### Compact Theta sketch summary:
   num retained keys    : 2178
   seed hash            : 37836
   empty?               : false
   ordered?             : true
   estimation mode?     : true
   theta (fraction)     : 0.28443
   theta (raw 64-bit)   : 2623405683183212657
   estimate             : 7657.41
   lower bound 95% conf : 7382.58
   upper bound 95% conf : 7942.37
### End sketch summary

### Compact Theta sketch summary:
   num retained keys    : 2178
   seed hash            : 37836
   empty?               : false
   ordered?             : true
   estimation mode?     : true
   theta (fraction)     : 0.28443
   theta (raw 64-bit)   : 2623405683183212657
   estimate             : 7657.41
   lower bound 95% conf : 7382.58
   upper bound 95% conf : 7942.37
### End sketch summary

### Compact Theta sketch summary:
   num retained keys    : 2178
   seed hash            : 37836
   empty?               : false
   ordered?             : true
   estimation mode?     : true
   theta (fraction)     : 0.28443
   theta (raw 64-bit)   : 2623405683183212657
   estimate             : 7657.41
   lower bound 95% conf : 7382.58
   upper bound 95% conf : 7942.37
### End sketch summary

### Compact Theta sketch summary:
   num retained keys    : 4635
   seed hash            : 37836
   empty?               : false
   ordered?             : true
   estimation mode?     : true
   theta (fraction)     : 0.28443
   theta (raw 64-bit)   : 2623405683183212657
   estimate             : 16295.7
   lower bound 95% conf : 15893.5
   upper bound 95% conf : 16708
### End sketch summary

In the 3 first cases, a_not_b is called with either 1 or 2 non ordered sketches, and the result is correct. In the last case it is called with 2 ordered sketches, and the estimate (16295.7) is higher than the estimate of a (10000).

(When calling it on unordered compact sketches, or when a > b, the results are correct)

AlexanderSaydakov commented 4 years ago

Thanks for finding this and for a clear description. I reproduced this and will come up with a fix shortly.