bcampbell / churnalism-extensions

3 stars 0 forks source link

Extension sometimes compares articles against themselves #4

Open bcampbell opened 10 years ago

bcampbell commented 10 years ago

Steps to reproduce: go to http://www.theguardian.com/uk-news/2014/jan/14/phone-hacking-trial-cctv-charlie-brooks-laptop?CMP=twt_gu

Observed: the churn warning pops up, and the same article is listed as a match.

Expected: Articles should never have themselves listed as possible churn matches.

bcampbell commented 10 years ago

The extension is being confused by the slightly-different URL. It should check for canonical URLs (eg rel-canonical) in the page and include those in the check to filter out same article from list of matches.

bcampbell commented 10 years ago

Similar case: http://www.independent.co.uk/student/news/david-caron-i-want-kings-college-london-to-be-the-harvey-nichols-of-law-schools-8577413.html

This one is trickier - the churnalism database has the article under a non-canonical URL (in the education section of the paper), which the extension has no way of discovering. The server needs to keep track of multiple URLs for articles (it might already, can't remember. If it is, the extension isn't picking them up).