Closed ldfelipe closed 9 years ago
Using urllib2.urlopen(url).geturl() will resolve this issue:
>>> import urllib2
>>> urllib2.urlopen('http://feeds.huffingtonpost.com/c/35496/f/677045/s/47940b3b/sc/7/l/0L0Shuffingtonpost0N0C20A150C0A60C250Cjewish0Eshabbat0Eof0Esolidary0Eafrican0Eamericans0In0I76571980Bhtml/story01.htm').geturl()
'http://www.huffingtonpost.com/2015/06/25/jewish-shabbat-of-solidary-african-americans_n_7657198.html'
>>> urllib2.urlopen('http://huffingtonpost.com/2015/06/25/jewish-shabbat-of-solidary-african-americans_n_7657198.html').geturl()
'http://www.huffingtonpost.com/2015/06/25/jewish-shabbat-of-solidary-african-americans_n_7657198.html'
>>>
fix implemented, pending testing before being merged.
The crawler is listing articles from referring sites: electronicintifada.net, foxnews.com twice with the same URL
As well, the crawler picked up the exact same article with two different URLs:
http://feeds.huffingtonpost.com/c/35496/f/677045/s/47940b3b/sc/7/l/0L0Shuffingtonpost0N0C20A150C0A60C250Cjewish0Eshabbat0Eof0Esolidary0Eafrican0Eamericans0In0I76571980Bhtml/story01.htm http://huffingtonpost.com/2015/06/25/jewish-shabbat-of-solidary-african-americans_n_7657198.html