Closed MichaelAquilina closed 10 years ago
When a link is found, check if the page exists, if it doesnt create a page entry but dont actually parse it (you cant anyway). At later stages always check to see if a Page exists in the Pages table when parsing it.
You could also detect relations between pages as strong indicators that the pages are related and therefore probably are correct concepts.
For example:
Guido van Rossum is the founder of Python.
Both "Python (Programming Language)" and "Guido van Rossum" pages have links to one another on wikipedia which is a strong indicator of the sentence talking about these two concepts.
If there is a way of collectively storing named entities or phrases, this could also improve accuracy:
Examples:
This could very largely help performance especially with unique terms such as "Rossum"
Storing links in "PageLinks" table in the mysql database.
Counting the number of incoming and outgoing links in Wikipedia pages could improve ranking performance as using tfidf is proving to have some problems as it stands when information retrieval techniques are performed.