Open khalednatour opened 4 years ago
Two problems:
url = "https://www.aa.com.tr/ar/%D8%AA%D8%B1%D9%83%D9%8A%D8%A7/%D8%A3%D8%B1%D8%AF%D9%88%D8%BA%D8%A7%D9%86-%D8%A8%D9%81%D8%B6%D9%84-%D9%86%D8%B6%D8%A7%D9%84%D9%86%D8%A7-%D8%A3%D9%86%D9%82%D8%B0%D9%86%D8%A7-%D8%A8%D9%84%D8%AF%D9%86%D8%A7-%D9%85%D9%86-%D8%A7%D9%84%D9%85%D9%83%D8%A7%D8%A6%D8%AF-%D8%A7%D9%84%D8%AE%D8%A8%D9%8A%D8%AB%D8%A9/1653238"
article = Article(url=url, language='ar')
article.download()
article.parse()
print(article.text)
>> ''
and for an english article on the same source
url = "https://www.aa.com.tr/en/sports/gozbasi-elected-interim-chairwoman-in-turkish-super-lig/1653219"
article = Article(url=url, language='en')
article.download()
article.parse()
print(article.text)
>> 'Your opinions matter to us times;\n\nFeedback 0 / 5'
This actually happens quite often using newspaper. Newspaper downloads the HTML associated to your source, parses the HTML and kind of tries to guess the articles content using stopwords. You could fork newspaper, tweak the arabic stopwords and see if that helps you.
I found that cookie banners, comment sections, advertisment prompts throw it off a lot. It works more often than not, yet I could not find a way to tweak it to edge-cases like yours.
This is not helping your case, but at least explains it to some extent.
Why can't I extract the article body from the following link?
https://www.aa.com.tr/ar/%D8%AA%D8%B1%D9%83%D9%8A%D8%A7/%D8%A3%D8%B1%D8%AF%D9%88%D8%BA%D8%A7%D9%86-%D8%A8%D9%81%D8%B6%D9%84-%D9%86%D8%B6%D8%A7%D9%84%D9%86%D8%A7-%D8%A3%D9%86%D9%82%D8%B0%D9%86%D8%A7-%D8%A8%D9%84%D8%AF%D9%86%D8%A7-%D9%85%D9%86-%D8%A7%D9%84%D9%85%D9%83%D8%A7%D8%A6%D8%AF-%D8%A7%D9%84%D8%AE%D8%A8%D9%8A%D8%AB%D8%A9/1653238