Closed ngrislain closed 9 years ago
What is the real count you are using to test with?
It's the exact cardinality of the set of words in a book.
2015-04-03 18:18 GMT+02:00 Joshua Andersen notifications@github.com:
What is the real count you are using to test with?
— Reply to this email directly or view it on GitHub https://github.com/ascv/HyperLogLog/issues/15#issuecomment-89341352.
What is the exact cardinality of the words in the book?
I compared : len(words) and hll.cardinality() where words is a set([]) where I added all the words while adding them with hll.add()
2015-04-03 18:21 GMT+02:00 Joshua Andersen notifications@github.com:
What is the exact cardinality?
— Reply to this email directly or view it on GitHub https://github.com/ascv/HyperLogLog/issues/15#issuecomment-89342303.
I need to know the exact cardinality you are using to test with in order to look into this issue. Please tell me the exact cardinality.
To phrase it a different way, in case I'm being unclear, what is the cardinality when you use set([]) and len(words)? 1,000? 10,000? 100,000?
Here is the code :
from urllib import urlopen import numpy as np import matplotlib.pyplot as pp from HLL import HyperLogLog
def word_generator(): i = 0 for line in urlopen('http://www.gutenberg.org/cache/epub/10/pg10.txt'):
for word in line.split():
yield word
i+=1
if i>10000000:
print "end"
return
words = set([]) for word in word_generator(): words.add(word)
print len(words)
result = [] for i in xrange(5): hll = HyperLogLog(14, seed=i) for word in word_generator(): hll.add(word) print i result.append(hll.cardinality()/float(len(words)))
pp.plot(result) pp.show()
2015-04-03 18:24 GMT+02:00 Joshua Andersen notifications@github.com:
I need to know the exact cardinality you are using to test with in order to look into this issue. Please tell me the test cardinality.
— Reply to this email directly or view it on GitHub https://github.com/ascv/HyperLogLog/issues/15#issuecomment-89343208.
Thanks, I'll take a look and see what is going on.
It is ~34000
2015-04-03 18:31 GMT+02:00 Joshua Andersen notifications@github.com:
To phrase it a different way, in case I'm being unclear, what is the cardinality when you use set([]) and len(words)? 1,000? 10,000? 100,000?
— Reply to this email directly or view it on GitHub https://github.com/ascv/HyperLogLog/issues/15#issuecomment-89344892.
What is happening is that the small range correction in the algorithm is being triggered for k > 13 so it's switching to a linear count to estimate the cardinality.
When does this happen? This happens when the cardinality isn't sufficiently large for the number of registers you are using. This causes many registers to stay 0. When this happens the algorithm notices it and switches to a linear count. In particular, the algorithm switches when the raw estimate <= 2.5 * 2^k.
Why does the algorithm need to switch to a linear count? Because those zero registers significantly inflate the cardinality estimate since 2^0=1 and the algorithm computes the sum 2^M[j] where M is the array of registers and j is an index. In fact the linear count is generally going to be better than the raw HyperLogLog estimate in this case.
Now let's look at this particular data set:
from HLL import HyperLogLog
from urllib import urlopen
import matplotlib.pyplot as pp
import numpy as np
def word_generator():
# http://www.gutenberg.org/cache/epub/10/pg10.txt
f = open('pg10.txt', 'r')
for line in f:
for word in line.split():
yield word
words = set([])
for word in word_generator():
words.add(word)
N = len(words)
print 'true cardinality: %s' % N
k_start = 10
k_end = 17
result = []
pct_zero = []
for i in xrange(k_start, k_end):
hll = HyperLogLog(i)
for word in word_generator():
hll.add(word)
registers = map(int, hll.registers())
zeros = filter(lambda x: x == 0, registers)
pct_zero.append(1.0*len(zeros)/len(registers))
result.append(hll.cardinality()/N)
k = range(k_start, k_end)
pp.plot(k, result, label='estimate/true cardinality')
pp.plot(k, pct_zero, label='% of registers that are 0', color='red')
pp.legend(loc='best')
pp.title('Demonstation of small range correction')
pp.xlabel('k')
pp.ylabel('%')
pp.grid()
pp.show()
You can see in the graph that the % of registers that are zero goes up significantly when k > 13 which corresponds to where the algorithm switches to a linear count.
How do I pick a suitable k? You need to test various k on a sample sof the data set and if possible leverage any heuristics to bound the cardinality. As a rule of thumb you probably want less than 1-2% of your registers to be zero.
I might update the documentation with some discussion about this and picking a suitable k.
Thank you for this very clear explanation
I compiled and installed your library, and the result seems biased when k>13 I tried with different seeds and the estimator consistently returns 1.4 x the real count...
Do you have the same issue ?
Best,