UTMediaCAT / Voyage

Other
12 stars 5 forks source link

More False Positives #6

Closed ldfelipe closed 8 years ago

ldfelipe commented 9 years ago

The crawler has been picking up Huffington Post pages that are not news articles but is a feed that links to articles related to a certain topic or person (ie.George Bush, Hilary Clinton). The crawler picks it up only because a contributing author's (H.A Goodman) description is found on the pages as a contributor to the Jerusalem Post (a matched keyword). Is there a way to fix this? And let me know if I can go ahead and again delete these manually?

these are some of the links where it shows up: http://huffingtonpost.com/news/george-w-bush http://huffingtonpost.com/news/hillary-clinton

yuya-iwabuchi commented 9 years ago

It is very difficult for the parser to figure out that these pages are NOT articles since the structure meets the broad criteria the parser has to determine and parse articles. Therefore deleting would just bring it back in on the next cycle of crawling.

As a suggestion, we can have another page (or per referring site basis) where we can specify the urls that should be avoided when crawling. For example, if you add 'http://huffingtonpost.com/news/george-w-bush' in the list, this exact page will never get stored into the database. Furthermore, we can have regex support to have general cases to avoid. For example, if you add 'http://huffingtonpost.com/news/.*' in the list, every pages that starts with 'http://huffingtonpost.com/news/' will never get added further on. This will avoid all the links you were mentioning in the issue.

kimpham54 commented 8 years ago

moved to trello