AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
422 stars 34 forks source link

Article.parse() not parsing entire article body correctly from HTML! #158

Open AndyTheFactory opened 10 months ago

AndyTheFactory commented 10 months ago

Issue by HodorTheCoder Thu Dec 7 23:10:13 2017 Originally opened as https://github.com/codelucas/newspaper/issues/485


Overview

When I download and parse an article, I'll include one here from CNN, it stops at the "Read More" in the HTML and doesn't parse the entire body. Similarly, it includes the highlights as part of the text, which I don't think it should do.

I'm fairly certain I'm not doing anything wrong, and this happens in both the Python 2 and Python 3 versions.

I've also tested it on two different systems (macOS on my laptop) from different locations and IP addresses, and an Ubuntu server, on yet a different IP. So I'm fairly certain it's got nothing to do with that.

How to reproduce:

>> from newspaper import Article

>> a = Article("http://www.cnn.com/2017/12/06/politics/al-franken-replacement/index.html")
>> a.download()
>> a.parse()

At this point, it prints out the following:

>> print(a.text)

Story highlights Democratic Gov. Mark Dayton would appoint a replacement if Franken resigns

That would set up a special election in November 2018

(CNN) Should Sen. Al Franken decide to step down, his resignation would set up a gubernatorial appointment and open up a new Senate battleground in 2018.

Minnesota Gov. Mark Dayton does not plan to get ahead of Franken's scheduled announcement Thursday, a senior Minnesota Democrat close to Dayton told CNN, but the governor's "expectation and hope is for Franken to resign."

Should Franken step down, top names to replace him are Democratic Reps. Keith Ellison and Tim Walz, this official said. Another leading contender will be Lt. Gov. Tina Smith, a former chief of staff to Dayton.

"Don't overlook Lt. Governor Smith," the official said. "She could be the perfect choice."

Dayton, a former US senator, might also tap his former colleagues for advice in his pick, including Senate Minority Leader Chuck Schumer. "There will be an open line of communication," said the senior Democratic strategist.

Read More

However, using the demo here:

http://newspaper-demo.herokuapp.com/

If you paste in the article:

http://www.cnn.com/2017/12/06/politics/al-franken-replacement/index.html (here's the activated link: http://newspaper-demo.herokuapp.com/articles/show?url_to_clean=http%3A%2F%2Fwww.cnn.com%2F2017%2F12%2F06%2Fpolitics%2Fal-franken-replacement%2Findex.html )

It parses the article correctly and displays the text I would expect. (I won't paste it here but you can try it yourself.)

Resolution

What am I doing wrong? The fact that this is consistent on two systems across multiple OS and python versions from different IP's indicates to me I must either have a broken requirement or that I'm doing something incorrectly.

In [9]: try:
   ...:     html_string = ElementTree.tostring(article.clean_top_node)
   ...: except:
   ...:     html_string = "Error converting html to string."
   ...:

In [10]: html_string
Out[10]: 'Error converting html to string.'

EDIT

I downloaded and ran the demo on my laptop locally to test what's happening differently, and I'm still getting the same result as when I do it inside ipython. What's interesting is that the Article HTML at the bottom of the demo page on the official Heroku page as well as what is running on my local demo are quite different. Here's what's on mine:

b'<div class="l-container" gravityNodes="15" gravityScore="243.5"><div class="el__leafmedia el__leafmedia--storyhighlights"><div class="el__storyhighlights_wrapper"><div class="el__storyhighlights"><h3 class="el__headline">Story highlights</h3><ul class="el__storyhighlights__list"><li class="el__storyhighlights__item el__storyhighlights--normal">Democratic Gov. Mark Dayton

and on the official demo:

<div gravityNodes="15" gravityScore="240"><p class="zn-body__paragraph speakable">Minnesota Gov. Mark Dayton

So it almost looks like an incongruency with how the HTML is being parsed. Could that be a tooling issue or a library issue of some kind? I just uninstalled and rerified on OS X that I have all the proper libraries installed/updated and I uninstalled/reinstalled newspaper3k, still the same result.

I really need to get this fixed as I'm trying to build a dataset for machine learning that I just realized might not be working properly. Help would be amazing.

Thanks! /h

AndyTheFactory commented 10 months ago

Comment by HodorTheCoder Fri Dec 8 15:53:47 2017


UPDATE

So, I re-cloned the demo locally, and created a brand new virtual environment for it and installed all the requirements exactly as they appear in requirements.txt (minus the virtualenv stuff) and low and behold, it works exactly like I expect-- in line with the live demo.

For reference, the demo is located here:

https://github.com/codelucas/newspaper-demo

and the requirements.txt states:

BeautifulSoup==3.2.1
Flask==0.10.1
Jinja2==2.7.3
MarkupSafe==0.23
Pillow==2.6.1
Werkzeug==0.9.6
argparse==1.2.1
cssselect==0.9.1
gunicorn==19.1.1
itsdangerous==0.24
lxml==3.4.0
newspaper==0.0.8
nltk==3.0.0
requests==2.4.3
six==1.8.0
stevedore==1.0.0
virtualenv==1.11.6
virtualenv-clone==0.2.5
virtualenvwrapper==4.3.1
wsgiref==0.1.2

So-- there seems to be some library funkiness that breaks when you don't use the EXACT requirements in that demo (that is almost 3 years old and using the old python2 implementation)

For reference, there's the requirements.txt that's within the latest version of newspaper from python2 on pipy, newspaper-0.1.0.7 found here: https://pypi.python.org/pypi/newspaper

beautifulsoup4==4.3.2
Pillow==2.5.1
PyYAML==3.11
cssselect==0.9.1
lxml==3.3.5
nltk==2.0.5
requests==2.6.0
six==1.7.3
jieba==0.35
feedparser==5.1.3
tldextract==1.5.1
feedfinder2==0.0.1
python-dateutil==2.4.0

You'll notice some weirdness. The demo uses an older version of newspaper (0.0.8) but has newer versions of a bunch of libraries, and OLDER versions of other libraries. lxml in the dmeo is 3.4 versus 3.3.5 in the python-2-head release. Pillow is 2.6.1 while in the head it's 2.5.1. nltk uses 3.0.0 in the demo, but in head it uses 2.0.5 (which I might add doesn't work anymore when you try to install it.)

I know python2 version is deprecated but it should still work. Just FYI.

So in essence, the latest version doesn't parse properly on either OS X or Ubuntu with latest libraries as installed when doing a pip install newspaper or pip3 install newspaper3k.

Please let me know if you have any idea what the solution might be. For now, I'll just isolate this in a virtual environment and call it remotely (which is not ideal but will work for the moment.)

I also see a few other comments in the issues about people seeing the same behavior as me where things don't parse correctly. This should solve your problem for the time being until we can figure out a fix for this.

Can anybody else recreate these results? (ie: install latest version either in python 2 or 3, downloading that article, verifiying it doesn't parse the body correctly, and then recreating in a virtual environment the exact library versions and verifying that it DOES work?)

Thank you all.

I'm here to help.

AndyTheFactory commented 10 months ago

Comment by HodorTheCoder Fri Dec 8 16:14:50 2017


UPDATE #2

Installing newspaper=0.0.8 if using python2 seems to be the BEST option that I've found. If you use --upgrade it'll upgrade you to all the latest libraries (the requirements don't list exact versions) and it all works. Just FYI.

(Still aren't thrilled about using an old, old version-- not sure if it's got all the best multithreading for massive downloads, etc, but I need the text to extract properly.)

AndyTheFactory commented 10 months ago

Comment by kshitijsachan Mon Apr 20 00:37:10 2020


Do we know if there is a fix for this in Python3? I'm having the same issue when using newspaper3k.

AndyTheFactory commented 10 months ago

Comment by bilaltahirz Sun May 24 21:37:35 2020


This issue is still present in python3 version of this library. I have reproduced it on CNN website

AndyTheFactory commented 10 months ago

Comment by Irtza Tue Jun 16 08:38:44 2020


Bump

AndyTheFactory commented 10 months ago

Comment by zomGreg Fri Feb 4 03:14:13 2022


Bump! Any word on this? Still seems to be an issue and I can't figure out a workaround. Seems to happen at random for cnn sites.