codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
13.89k stars 2.1k forks source link

Article.text doesn't provide full article for some URLs #399

Open pratik151192 opened 6 years ago

pratik151192 commented 6 years ago

On fetching the article content from Article.text; only a few of the initial paragraphs get fetched sometimes. It gets appended with "Read More" at the end. In some cases, even "Read More" doesn't appear.

URL to reproduce error: http://www.cnn.com/2017/01/30/politics/trump-immigration-ban-refugees-trnd/index.html

The demo website link: http://newspaper-demo.herokuapp.com/articles/show?url_to_clean=http%3A%2F%2Fwww.cnn.com%2F2017%2F01%2F30%2Fpolitics%2Ftrump-immigration-ban-refugees-trnd%2Findex.html

The demo website does fetch the content but querying it through my code doesn't

Cabu commented 6 years ago

I have a similar problem with the NYTimes where the beginning of the article is not loaded. The article is written over 2 DIVs and the system pick the second (bigger) one...

The article: https://www.nytimes.com/2017/09/04/world/asia/muslims-rohingya-daw-aung-san-suu-kyi-malala-myanmar.html

ckcollab commented 1 year ago

This still seems like a problem:

>>> url = "https://www.cnn.com/2022/08/03/media/alex-jones-sandy-hook-trial/index.html"
>>> a = Article(url)
>>> a.download()
>>> a.parse()
>>> a.text
"New York (CNN Business) <a paragraph or two>...\n\nRead More"
>>>