codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.12k stars 2.11k forks source link

Why can't I extract the article body from the following link? #752

Open khalednatour opened 4 years ago

khalednatour commented 4 years ago

Why can't I extract the article body from the following link?

https://www.aa.com.tr/ar/%D8%AA%D8%B1%D9%83%D9%8A%D8%A7/%D8%A3%D8%B1%D8%AF%D9%88%D8%BA%D8%A7%D9%86-%D8%A8%D9%81%D8%B6%D9%84-%D9%86%D8%B6%D8%A7%D9%84%D9%86%D8%A7-%D8%A3%D9%86%D9%82%D8%B0%D9%86%D8%A7-%D8%A8%D9%84%D8%AF%D9%86%D8%A7-%D9%85%D9%86-%D8%A7%D9%84%D9%85%D9%83%D8%A7%D8%A6%D8%AF-%D8%A7%D9%84%D8%AE%D8%A8%D9%8A%D8%AB%D8%A9/1653238

iwpnd commented 4 years ago

Two problems:

url = "https://www.aa.com.tr/ar/%D8%AA%D8%B1%D9%83%D9%8A%D8%A7/%D8%A3%D8%B1%D8%AF%D9%88%D8%BA%D8%A7%D9%86-%D8%A8%D9%81%D8%B6%D9%84-%D9%86%D8%B6%D8%A7%D9%84%D9%86%D8%A7-%D8%A3%D9%86%D9%82%D8%B0%D9%86%D8%A7-%D8%A8%D9%84%D8%AF%D9%86%D8%A7-%D9%85%D9%86-%D8%A7%D9%84%D9%85%D9%83%D8%A7%D8%A6%D8%AF-%D8%A7%D9%84%D8%AE%D8%A8%D9%8A%D8%AB%D8%A9/1653238" 

article = Article(url=url, language='ar')
article.download()
article.parse()
print(article.text)
>> ''

and for an english article on the same source

url = "https://www.aa.com.tr/en/sports/gozbasi-elected-interim-chairwoman-in-turkish-super-lig/1653219"
article = Article(url=url, language='en')
article.download()
article.parse()
print(article.text)
>> 'Your opinions matter to us times;\n\nFeedback 0 / 5'

This actually happens quite often using newspaper. Newspaper downloads the HTML associated to your source, parses the HTML and kind of tries to guess the articles content using stopwords. You could fork newspaper, tweak the arabic stopwords and see if that helps you.

I found that cookie banners, comment sections, advertisment prompts throw it off a lot. It works more often than not, yet I could not find a way to tweak it to edge-cases like yours.

This is not helping your case, but at least explains it to some extent.