We are using a bloom filter to decide which n-grams to count during the label occurrence extraction step. We seem to be getting a very large number of false positives (In the simple wikipedia, we get 2M misses, where we only expect a few thousand). This has a big effect on how much work the combiners and reducers have to do.
It also doesn't look good for my proposed stratgey for Issue #9
We are using a bloom filter to decide which n-grams to count during the label occurrence extraction step. We seem to be getting a very large number of false positives (In the simple wikipedia, we get 2M misses, where we only expect a few thousand). This has a big effect on how much work the combiners and reducers have to do.
It also doesn't look good for my proposed stratgey for Issue #9