Lots of bloom filter false positives in labelOccurrence extraction step

We are using a bloom filter to decide which n-grams to count during the label occurrence extraction step. We seem to be getting a very large number of false positives (In the simple wikipedia, we get 2M misses, where we only expect a few thousand). This has a big effect on how much work the combiners and reducers have to do.

It also doesn't look good for my proposed stratgey for Issue #9

dnmilne / wikipediaminer

Lots of bloom filter false positives in labelOccurrence extraction step #14