dnmilne / wikipediaminer

An open source toolkit for mining Wikipedia
130 stars 62 forks source link

Rethink page link caching #12

Open dnmilne opened 10 years ago

dnmilne commented 10 years ago

Caching of page links is only necessary for comparison (and by extension, wikification and suggestion). We could probably rely on db lookups if these relatedness measures were precalculated (so one db lookup per pair, rather than two lookups+calculation). Although there is a crazy number of article pairs, there are only a small proportion that have a chance to be related, and probably a long tail of barely-related pairs that could be safely ignored.

apohllo commented 10 years ago

Hi @dnmilne I considered various options in my implementation of part of Wikipedia Miner in Ruby (not publicly available yet). And it seams that the best solution is a hybrid one - pre-calculate relatedness measures for popular article pairs and calculate just-in-time for the other. Still I have my own implementation of DB to store the data (based partly on Berkeley DB), called Ruby Object Database. There is a special routine designed exactly for computing article link intersection. This solution seems to work quite nice.

Neuw84 commented 10 years ago

Hi, @apohllo , how do you choose popular article pairs?

apohllo commented 10 years ago

Well, these are articles with many incoming links - they have high chance to appear in the processed texts. E.g. Washington, United States, etc. What is more - since they have many incoming links, computation of link intersection is the slowest. As a result the performance boost is even higher.

dnmilne commented 10 years ago

Hi @apohllo, that makes a lot of sense. I'll probably adopt a similar approach, depending on how I go in exhaustively calculating relatedness at extraction time.