ascv / HyperLogLog

Fast HyperLogLog for Python.
MIT License
99 stars 19 forks source link

Biased estimation for k>13 #15

Closed ngrislain closed 9 years ago

ngrislain commented 9 years ago

I compiled and installed your library, and the result seems biased when k>13 I tried with different seeds and the estimator consistently returns 1.4 x the real count...

Do you have the same issue ?

Best,

ascv commented 9 years ago

What is the real count you are using to test with?

ngrislain commented 9 years ago

It's the exact cardinality of the set of words in a book.

2015-04-03 18:18 GMT+02:00 Joshua Andersen notifications@github.com:

What is the real count you are using to test with?

— Reply to this email directly or view it on GitHub https://github.com/ascv/HyperLogLog/issues/15#issuecomment-89341352.

ascv commented 9 years ago

What is the exact cardinality of the words in the book?

ngrislain commented 9 years ago

I compared : len(words) and hll.cardinality() where words is a set([]) where I added all the words while adding them with hll.add()

2015-04-03 18:21 GMT+02:00 Joshua Andersen notifications@github.com:

What is the exact cardinality?

— Reply to this email directly or view it on GitHub https://github.com/ascv/HyperLogLog/issues/15#issuecomment-89342303.

ascv commented 9 years ago

I need to know the exact cardinality you are using to test with in order to look into this issue. Please tell me the exact cardinality.

ascv commented 9 years ago

To phrase it a different way, in case I'm being unclear, what is the cardinality when you use set([]) and len(words)? 1,000? 10,000? 100,000?

ngrislain commented 9 years ago

Here is the code :

from urllib import urlopen import numpy as np import matplotlib.pyplot as pp from HLL import HyperLogLog

def word_generator(): i = 0 for line in urlopen('http://www.gutenberg.org/cache/epub/10/pg10.txt'):

for line in urlopen('local/path/to/pg10.txt'):

    for word in line.split():
        yield word
        i+=1
        if i>10000000:
            print "end"
            return

Exact evaluation

words = set([]) for word in word_generator(): words.add(word)

print len(words)

result = [] for i in xrange(5): hll = HyperLogLog(14, seed=i) for word in word_generator(): hll.add(word) print i result.append(hll.cardinality()/float(len(words)))

pp.plot(result) pp.show()

2015-04-03 18:24 GMT+02:00 Joshua Andersen notifications@github.com:

I need to know the exact cardinality you are using to test with in order to look into this issue. Please tell me the test cardinality.

— Reply to this email directly or view it on GitHub https://github.com/ascv/HyperLogLog/issues/15#issuecomment-89343208.

ascv commented 9 years ago

Thanks, I'll take a look and see what is going on.

ngrislain commented 9 years ago

It is ~34000

2015-04-03 18:31 GMT+02:00 Joshua Andersen notifications@github.com:

To phrase it a different way, in case I'm being unclear, what is the cardinality when you use set([]) and len(words)? 1,000? 10,000? 100,000?

— Reply to this email directly or view it on GitHub https://github.com/ascv/HyperLogLog/issues/15#issuecomment-89344892.

ascv commented 9 years ago

What is happening is that the small range correction in the algorithm is being triggered for k > 13 so it's switching to a linear count to estimate the cardinality.

When does this happen? This happens when the cardinality isn't sufficiently large for the number of registers you are using. This causes many registers to stay 0. When this happens the algorithm notices it and switches to a linear count. In particular, the algorithm switches when the raw estimate <= 2.5 * 2^k.

Why does the algorithm need to switch to a linear count? Because those zero registers significantly inflate the cardinality estimate since 2^0=1 and the algorithm computes the sum 2^M[j] where M is the array of registers and j is an index. In fact the linear count is generally going to be better than the raw HyperLogLog estimate in this case.

Now let's look at this particular data set:

from HLL import HyperLogLog
from urllib import urlopen
import matplotlib.pyplot as pp
import numpy as np

def word_generator():
    # http://www.gutenberg.org/cache/epub/10/pg10.txt
    f = open('pg10.txt', 'r')
    for line in f:
        for word in line.split():
            yield word

words = set([])
for word in word_generator():
    words.add(word)

N = len(words)
print 'true cardinality: %s' % N

k_start = 10
k_end = 17
result = []
pct_zero = []
for i in xrange(k_start, k_end):
    hll = HyperLogLog(i)
    for word in word_generator():
        hll.add(word)
    registers = map(int, hll.registers())
    zeros = filter(lambda x: x == 0, registers)
    pct_zero.append(1.0*len(zeros)/len(registers))
    result.append(hll.cardinality()/N)

k = range(k_start, k_end)
pp.plot(k, result, label='estimate/true cardinality')
pp.plot(k, pct_zero, label='% of registers that are 0', color='red')
pp.legend(loc='best')
pp.title('Demonstation of small range correction')
pp.xlabel('k')
pp.ylabel('%')
pp.grid()
pp.show()

You can see in the graph that the % of registers that are zero goes up significantly when k > 13 which corresponds to where the algorithm switches to a linear count.

How do I pick a suitable k? You need to test various k on a sample sof the data set and if possible leverage any heuristics to bound the cardinality. As a rule of thumb you probably want less than 1-2% of your registers to be zero.

I might update the documentation with some discussion about this and picking a suitable k.

ngrislain commented 9 years ago

Thank you for this very clear explanation