MichaelAquilina / Reddit-Recommender-Bot

Indentifying Interesting Documents for Reddit using Recommender Techniques
7 stars 0 forks source link

Pages should be penalised for having a lot of high weighted tfidf terms #85

Closed MichaelAquilina closed 10 years ago

MichaelAquilina commented 10 years ago

Certain pages like: "The Witcher 2: Assassins of Kings" and "Reflections Projections" have a large range of (singular) rare words which make them extremely likely to become a search result if one of the terms forms part of the query vector. Ideally a page should be normalised by its total tfidf (ie its complete norm) rather than just the norm of the filtered terms related to the query vector.

MichaelAquilina commented 10 years ago

This should now be possible with the availability of the TfidfValues table. Might be good to precompute the totals much like how Lengths are calculated for pages.

MichaelAquilina commented 10 years ago

This has been implemented with the current WikiTest3 database.