Open gsauthof opened 7 years ago
The author issue is because of the following tag: <meta name="DC.creator" content="heise online">
Newpaper's extractor's get_authors()
method will check for dc.creator
as this is used for authors.
Now the extractor also (correctly) identified this tag <meta name="author" content="André Kramer" />
.
Given that it's found two different types of author tags with differing content, it might be reasonable to infer that one is incorrect. This package does not do so though, and so both are returned.
I have looked into the a.text
problem and found that it is certainly an lxml.html
parser problem. Specifically lxml.html.fromstring(doc)
is called, which returns an object who's .text
attribute is an empty string.
I have not yet found why the lxml
parser failed on this particular article, but as it stands this is not a problem of the Newspaper package.
Actually it looks like it might be a problem with this line in article.py
:
self.extractor.calculate_best_node(self.doc)
Because the lxml
parser is returning the text content in doc.text_content()
as it should.
Example article: https://www.heise.de/newsticker/meldung/Facebook-Sicherheitscheck-verbreitet-falschen-Alarm-3582470.html
Standard session:
Actual behaviour:
a.text
is a string of length 0 anda.authors
contains 2 elements, i.e.'heise'
and'André Kramer'
.Expected behaviour:
a.text
contains the complete article text anda.authors
contains just the one element'André Kramer'
.Reproducible: always
Version: from PyPi, installed today