codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.06k stars 2.11k forks source link

Doesn't work with Arabic news sites #23

Closed mahmoudhossam closed 10 years ago

mahmoudhossam commented 10 years ago

I tried newspaper 0.0.6 with a bunch of Arabic websites and it didn't seem to fetch any articles.

In [38]: newspaper.version.version_info
Out[38]: (0, 0, 6)

In [39]: alarabiya = newspaper.build('http://www.alarabiya.net/', language='ar')

In [40]: tahrirnews = newspaper.build('http://tahrirnews.com/', language='ar')

In [41]: ahram = newspaper.build('http://www.ahram.org.eg/', language='ar')

In [42]: almasryalyoum = newspaper.build('http://www.almasryalyoum.com/', language='ar')

In [43]: for src in (alarabiya, tahrirnews, ahram, almasryalyoum):
   ....:     print(src.size())
   ....:     
0
0
0
0
codelucas commented 10 years ago

I'll look into this, sorry for that.

codelucas commented 10 years ago

Booyah, found the fix! Well, it works for the last 3 domains.

For tahrir, ahram, and alma, the issue was that they were using .aspx pages and my code wrongly rejected pages that ended in .aspx or .jsp. So that has been fixed in newspaper/urls.py. This change will be active in the new release of newspaper, so you may have to wait a few days (or download the dev version from github now!)

When you are debugging yourself, feel free to try out the verbose command:

>>> tahrir=newspaper.build('http://tahrirnews.com/', verbose=True, memoize_articles=False)
120->58->58 for http://tahrirnews.com/culture
180->70->70 for http://tahrirnews.com/
49->5->5 for http://tahrirnews.com/pdf
3->0->0 for http://tahrirnews.com/rss
180->70->70 for http://tahrirnews.com
118->58->58 for http://tahrirnews.com/egypt
49->5->5 for http://tahrirnews.com/prayertimes
118->58->58 for http://tahrirnews.com/economy
180->70->70 for http://www.tahrirnews.tv
118->58->58 for http://tahrirnews.com/sports
118->58->58 for http://tahrirnews.com/sports
113->37->37 for http://tahrirnews.com/world-regions
118->58->58 for http://tahrirnews.com/accidents
113->37->37 for http://tahrirnews.com/local
50->5->5 for http://tahrirnews.com/ads
150->5->5 for http://tahrirnews.com/columns
120->58->58 for http://tahrirnews.com/politics
48->5->5 for http://tahrirnews.com/currency
62->5->5 for http://tahrirnews.com/comics
118->58->58 for http://tahrirnews.com/variety

The 118->58->58 above indicates that we see 118 links on that category page, we filter that down to 58 through our urls.py logic, and then we filter again down to 58 through memoization (which we have turned off, so 58 does not change!).

For Alarabiya it still won't work due to the fact that it requires javascript to be turned on + cookies (which we have).

Check out how I discovered this:

>>> import newspaper
>>> ala = newspaper.build('http://www.alarabiya.net/')
# the above command finishes in 0.5 seconds. Whenever a build() command
# runs that quickly, something is very wrong

>>> ala.html
'<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

<html><head><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><meta http-

equiv="Content-Script-Type" content="text/javascript"><script type="text/javascript">function 

setCookie(c_name, value, expiredays) { var exdate = new Date(); 

exdate.setDate(exdate.getDate()+expiredays); document.cookie = c_name + "=" + escape(value) +

 ((expiredays==null) ? "" : ";expires=" + exdate.toGMTString()) + ";path=/"; } function getHostUri() { var

 loc = document.location; return loc.toString(); } 

setCookie(\'YPF8827340282Jdskjhfiw_928937459182JAX666\', \'174.77.53.97\', 10); 

setCookie(\'DOAReferrer\', document.referrer, 10); location.href = getHostUri();</script></head><body>

<noscript>This site requires JavaScript and Cookies to be enabled. Please change your browser 

settings or upgrade your browser.</noscript></body></html>\n'

Note from the html above:

Thanks for notifying me of this issue. Let me know if it's ok to close it!