dnmilne / wikipediaminer

An open source toolkit for mining Wikipedia
130 stars 62 forks source link

Lots of bloom filter false positives in labelOccurrence extraction step #14

Open dnmilne opened 10 years ago

dnmilne commented 10 years ago

We are using a bloom filter to decide which n-grams to count during the label occurrence extraction step. We seem to be getting a very large number of false positives (In the simple wikipedia, we get 2M misses, where we only expect a few thousand). This has a big effect on how much work the combiners and reducers have to do.

It also doesn't look good for my proposed stratgey for Issue #9