MichaelAquilina / Reddit-Recommender-Bot

Indentifying Interesting Documents for Reddit using Recommender Techniques
7 stars 0 forks source link

Counting the number of links in Wiki pages could improve accuracy #73

Closed MichaelAquilina closed 10 years ago

MichaelAquilina commented 10 years ago

Counting the number of incoming and outgoing links in Wikipedia pages could improve ranking performance as using tfidf is proving to have some problems as it stands when information retrieval techniques are performed.

MichaelAquilina commented 10 years ago

When a link is found, check if the page exists, if it doesnt create a page entry but dont actually parse it (you cant anyway). At later stages always check to see if a Page exists in the Pages table when parsing it.

MichaelAquilina commented 10 years ago

You could also detect relations between pages as strong indicators that the pages are related and therefore probably are correct concepts.

For example: Guido van Rossum is the founder of Python.

Both "Python (Programming Language)" and "Guido van Rossum" pages have links to one another on wikipedia which is a strong indicator of the sentence talking about these two concepts.

MichaelAquilina commented 10 years ago

If there is a way of collectively storing named entities or phrases, this could also improve accuracy:

Examples:

This could very largely help performance especially with unique terms such as "Rossum"

MichaelAquilina commented 10 years ago

Storing links in "PageLinks" table in the mysql database.