codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.06k stars 2.11k forks source link

Entire article not downloaded when there are multiple images #775

Open EHTaylor12 opened 4 years ago

EHTaylor12 commented 4 years ago

I am running into an issue in which Newspaper3k does not download an article entirely if the article has an image embedded in the middle (usually a chart or graph image) of text.

This is an example article Sample Article in which Newspaper3k stops at the beginning of the graph embedded in the middle of the article.

Has anyone else experienced this or found a solution? I don’t need the images and if there was a way to disable loading images in the settings that would be even better.

edoardobassett commented 4 years ago

Yeah seem to be having a similar problem as well. I am trying to extract the text from the example link they gave http://fox13now.com/2013/12/30/new-year-new-laws-obamacare-pot-guns-and-drones/

but it's not working..

EHTaylor12 commented 4 years ago

I have a possible work around by downloading/saving the desired webpage using Chrono Download Manager (chrome extension) and than parsing it with newspaper3k.

I am struggling with how to parse locally stored html files with newspaper3k as the read the docs documentation is vague and incomplete in the topic. Does anyone have any detailed guidance they could offer or point me towards a good resource in this subject?