Extracting text and author from a heise.de article

codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:

https://goo.gl/VX41yK

MIT License

14.09k stars 2.11k forks source link

Extracting text and author from a heise.de article #314

Open gsauthof opened 7 years ago

gsauthof commented 7 years ago

Example article: https://www.heise.de/newsticker/meldung/Facebook-Sicherheitscheck-verbreitet-falschen-Alarm-3582470.html

Standard session:

a = newspaper.Article('https://www.heise.de/newsticker/meldung/Facebook-Sicherheitscheck-verbreitet-falschen-Alarm-3582470.html')
a.download()
a.parse()

Actual behaviour:

a.text is a string of length 0 and a.authors contains 2 elements, i.e. 'heise' and 'André Kramer'.

Expected behaviour:

a.text contains the complete article text and a.authors contains just the one element 'André Kramer'.

Reproducible: always

Version: from PyPi, installed today

thundergolfer commented 7 years ago

The author issue is because of the following tag: <meta name="DC.creator" content="heise online">

Newpaper's extractor's get_authors() method will check for dc.creator as this is used for authors.

Now the extractor also (correctly) identified this tag <meta name="author" content="André Kramer" />.

Given that it's found two different types of author tags with differing content, it might be reasonable to infer that one is incorrect. This package does not do so though, and so both are returned.

thundergolfer commented 7 years ago

I have looked into the a.text problem and found that it is certainly an lxml.html parser problem. Specifically lxml.html.fromstring(doc) is called, which returns an object who's .text attribute is an empty string.

~~I have not yet found why the lxml parser failed on this particular article, but as it stands this is not a problem of the Newspaper package.~~

Actually it looks like it might be a problem with this line in article.py:

self.extractor.calculate_best_node(self.doc)

Because the lxml parser is returning the text content in doc.text_content() as it should.