Duplicates - Githubissues

ldfelipe commented 9 years ago

The crawler is listing articles from referring sites: electronicintifada.net, foxnews.com twice with the same URL

As well, the crawler picked up the exact same article with two different URLs:

http://feeds.huffingtonpost.com/c/35496/f/677045/s/47940b3b/sc/7/l/0L0Shuffingtonpost0N0C20A150C0A60C250Cjewish0Eshabbat0Eof0Esolidary0Eafrican0Eamericans0In0I76571980Bhtml/story01.htm http://huffingtonpost.com/2015/06/25/jewish-shabbat-of-solidary-african-americans_n_7657198.html

yuya-iwabuchi commented 9 years ago

Using urllib2.urlopen(url).geturl() will resolve this issue:

>>> import urllib2
>>> urllib2.urlopen('http://feeds.huffingtonpost.com/c/35496/f/677045/s/47940b3b/sc/7/l/0L0Shuffingtonpost0N0C20A150C0A60C250Cjewish0Eshabbat0Eof0Esolidary0Eafrican0Eamericans0In0I76571980Bhtml/story01.htm').geturl()
'http://www.huffingtonpost.com/2015/06/25/jewish-shabbat-of-solidary-african-americans_n_7657198.html'
>>> urllib2.urlopen('http://huffingtonpost.com/2015/06/25/jewish-shabbat-of-solidary-african-americans_n_7657198.html').geturl()                         
'http://www.huffingtonpost.com/2015/06/25/jewish-shabbat-of-solidary-african-americans_n_7657198.html'
>>>

zhouwein commented 9 years ago

fix implemented, pending testing before being merged.

UTMediaCAT / Voyage

Duplicates #7