Open dnmilne opened 10 years ago
Hi @dnmilne I considered various options in my implementation of part of Wikipedia Miner in Ruby (not publicly available yet). And it seams that the best solution is a hybrid one - pre-calculate relatedness measures for popular article pairs and calculate just-in-time for the other. Still I have my own implementation of DB to store the data (based partly on Berkeley DB), called Ruby Object Database. There is a special routine designed exactly for computing article link intersection. This solution seems to work quite nice.
Hi, @apohllo , how do you choose popular article pairs?
Well, these are articles with many incoming links - they have high chance to appear in the processed texts. E.g. Washington, United States, etc. What is more - since they have many incoming links, computation of link intersection is the slowest. As a result the performance boost is even higher.
Hi @apohllo, that makes a lot of sense. I'll probably adopt a similar approach, depending on how I go in exhaustively calculating relatedness at extraction time.
Caching of page links is only necessary for comparison (and by extension, wikification and suggestion). We could probably rely on db lookups if these relatedness measures were precalculated (so one db lookup per pair, rather than two lookups+calculation). Although there is a crazy number of article pairs, there are only a small proportion that have a chance to be related, and probably a long tail of barely-related pairs that could be safely ignored.