Closed WheresWardy closed 10 years ago
Good idea. The feature has not been added in the documentation yet but a few days ago someone commited and pushed an update similar to what you are saying into newspaper.
View this commit: https://github.com/codelucas/newspaper/pull/11
What this does is: there is now a variable in every article called article_html
which contains the html markup + text of the main body text. Be sure to set the keep_article_html
value in the config object to True
.
Is this similar enough to what you want? Or did you actually mean like generate and retain markdown from the text.
Thanks.
Hi @codelucas, thanks for your quick reply!
This seems to be exactly what I'm looking for, but I'm having trouble getting it to work with the latest PIP version of newspaper (0.0.6).
Example code
from newspaper import Article, Config
c = Config()
c.keep_article_html = True
a = Article('http://arstechnica.com/information-technology/2014/01/quarkxpress-the-demise-of-a-design-desk-darling/', c)
a.download()
a.parse()
print len(a.text)
print len(a.article_html)
Output
8325
0
As you can see, I don't seem to get any article_html output from newspaper, even though the text value seems to indicate that it downloaded and processed the article URL correctly?
I just mimicked your code with one tweak and it worked. It was how you are passing in configs, you should use a named argument for your config because the second argument is actually something else, not the config. However, argument passing has been simplified in newspaper 0.0.6 anyways so you can directly pass in config arguments! I'll show both examples below.
Example 1:
from newspaper import Article
a = Article('http://arstechnica.com/information-technology/2014/01/quarkxpress-the-demise-of-a-design-desk-darling/', keep_article_html=True)
a.download()
a.parse()
print len(a.article_html)
Example 2:
from newspaper import Article, Config
c = Config()
c.keep_article_html = True
a = Article('http://arstechnica.com/information-technology/2014/01/qu
arkxpress-the-demise-of-a-design-desk-darling/', config=c) # NAMED ARG
a.download()
a.parse()
print len(a.article_html)
Config objects are kinda a hassle, it's easier just to pass in raw arguments so we made it like that! No worries though, changing our api without warning was our fault! Good luck!
Update me if this works so I can close the issue! Thanks.
Yup, that seems to have done the trick! I just have one more quick (hopefully related) question: I want to expand the supported HTML tags to include more markup, so I've edited line 166 in newspaper/parsers.py
to include the extra tags I want, but it doesn't seem to be taking effect (for example, I've added the h*
tags, img
, code
and pre
) - is this the only place that needs editing to support additional tags?
Submit a pull request so everyone can see what you are doing.
Thanks
No problem, it's in #21
Your code looks like it should work. That is the only location where any change is required, you are right.
Hmm, this issue can come from lxml.html.clean
, which is our cleaner.
Refer to http://lxml.de/api/lxml.html.clean.Cleaner-class.html
Can you post a failure example case where the cleaner fails for your new tags?
The article/code I posted above (the one that you corrected the config for) is a good example - after the first paragraph, there's an h2
element that gets stripped (as do all the paragraph titles in that article):
<h2>The rise</h2>
There's also quite a few images on that page, and they all get stripped.
This should work as an example failure case (currently it outputs 0, but should hopefully output more than 0 if it's working)
import re
from newspaper import Article
a = Article('http://arstechnica.com/information-technology/2014/01/quarkxpress-the-demise-of-a-design-desk-darling/', keep_article_html=True)
a.download()
a.parse()
m = re.findall('<h2', a.article_html, flags=re.I)
print len(m)
Sorry for the late response, school is picking up.
I found the error. Refer to this line of code: https://github.com/codelucas/newspaper/blob/master/newspaper/article.py#L194
So pretty much after the cleanup of the dom, newspaper scans the dom again and removes any tags that "looks like non-content, clusters of links, or paras with no gusto.". Gusto being a measurement for the body text extraction algorithm.
If you comment that line out, your len(m)
value suddenly becomes 3, which is the correct value.
The tradeoff to leaving that line commented is that will probably receive more unwanted body text in exchange for missing out on perfecting this article_html feature. Whether or not we should do that is up for debate. I will merge your pull request but I won't comment out that post_cleanup line on any production code until we make a formal decision.
Thanks!
No problem at all, you've been more than speedy with your responses to what is essentially a pretty tiny (and already existing) feature request from me!
That's perfect for now - any other cleanup of the resulting article_html after that line has been commented out I can always do myself. Thanks for all your help.
If I keep the keep_article_html=True as a config, I get the following error on running parse :
File "<stdin>", line 1, in <module>
File "/Users/aram/code/microblog/flask/lib/python2.7/site-packages/newspaper/article.py", line 221, in parse
self.top_node)
File "/Users/aram/code/microblog/flask/lib/python2.7/site-packages/newspaper/outputformatters.py", line 47, in get_formatted
html = self.convert_to_html()
File "/Users/aram/code/microblog/flask/lib/python2.7/site-packages/newspaper/outputformatters.py", line 68, in convert_to_html
cleaned_node = self.parser.clean_article_html(self.get_top_node())
File "/Users/aram/code/microblog/flask/lib/python2.7/site-packages/newspaper/parsers.py", line 62, in clean_article_html
article_cleaner = lxml.html.clean.Cleaner()
AttributeError: 'module' object has no attribute 'clean'
I currently use Boilerpipe to do article extraction in order to generate Kindle MOBI files to send to my Kindle. I'm wondering if it's possible to feature-request the ability to do something similar in Newspaper: in that the article text extraction retains a minimal set of markup around it, enough to give the text structure as far as HTML is concerned. This makes forward conversion to other formats a lot easier, and allows the ability to retain certain markup that can only be expressed using HTML (such as images in situ and code fragments).