AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
446 stars 39 forks source link

Article.text doesn't provide full article for some URLs #109

Open AndyTheFactory opened 11 months ago

AndyTheFactory commented 11 months ago

Issue by pratik151192 Wed Jul 12 21:00:38 2017 Originally opened as https://github.com/codelucas/newspaper/issues/399


On fetching the article content from Article.text; only a few of the initial paragraphs get fetched sometimes. It gets appended with "Read More" at the end. In some cases, even "Read More" doesn't appear.

URL to reproduce error: http://www.cnn.com/2017/01/30/politics/trump-immigration-ban-refugees-trnd/index.html

The demo website link: http://newspaper-demo.herokuapp.com/articles/show?url_to_clean=http%3A%2F%2Fwww.cnn.com%2F2017%2F01%2F30%2Fpolitics%2Ftrump-immigration-ban-refugees-trnd%2Findex.html

The demo website does fetch the content but querying it through my code doesn't

AndyTheFactory commented 11 months ago

Comment by Cabu Tue Sep 5 09:28:43 2017


I have a similar problem with the NYTimes where the beginning of the article is not loaded. The article is written over 2 DIVs and the system pick the second (bigger) one...

The article: https://www.nytimes.com/2017/09/04/world/asia/muslims-rohingya-daw-aung-san-suu-kyi-malala-myanmar.html

AndyTheFactory commented 11 months ago

Comment by ckcollab Thu Aug 4 04:16:06 2022


This still seems like a problem:

>>> url = "https://www.cnn.com/2022/08/03/media/alex-jones-sandy-hook-trial/index.html"
>>> a = Article(url)
>>> a.download()
>>> a.parse()
>>> a.text
"New York (CNN Business) <a paragraph or two>...\n\nRead More"
>>>