dnmilne / wikipediaminer

An open source toolkit for mining Wikipedia
130 stars 62 forks source link

Rethink label caching #9

Open dnmilne opened 10 years ago

dnmilne commented 10 years ago

Caching of labels is only necessary for wikification, and in this situation there are waay more misses than hits, because we check every ngram in the document and most of these are nonsense phrases. A bloom filter would quickly get rid of all of the misses, and looking up the hits would probably be fast enough via the database.