Retain HTML markup for extracted article

WheresWardy commented 10 years ago

I currently use Boilerpipe to do article extraction in order to generate Kindle MOBI files to send to my Kindle. I'm wondering if it's possible to feature-request the ability to do something similar in Newspaper: in that the article text extraction retains a minimal set of markup around it, enough to give the text structure as far as HTML is concerned. This makes forward conversion to other formats a lot easier, and allows the ability to retain certain markup that can only be expressed using HTML (such as images in situ and code fragments).

codelucas commented 10 years ago

Good idea. The feature has not been added in the documentation yet but a few days ago someone commited and pushed an update similar to what you are saying into newspaper.

View this commit: https://github.com/codelucas/newspaper/pull/11

What this does is: there is now a variable in every article called article_html which contains the html markup + text of the main body text. Be sure to set the keep_article_html value in the config object to True.

Is this similar enough to what you want? Or did you actually mean like generate and retain markdown from the text.

Thanks.

WheresWardy commented 10 years ago

Hi @codelucas, thanks for your quick reply!

This seems to be exactly what I'm looking for, but I'm having trouble getting it to work with the latest PIP version of newspaper (0.0.6).

Example code

from newspaper import Article, Config
c = Config()
c.keep_article_html = True
a = Article('http://arstechnica.com/information-technology/2014/01/quarkxpress-the-demise-of-a-design-desk-darling/', c)
a.download()
a.parse()

print len(a.text)
print len(a.article_html)

Output

8325
0

As you can see, I don't seem to get any article_html output from newspaper, even though the text value seems to indicate that it downloaded and processed the article URL correctly?

codelucas commented 10 years ago

I just mimicked your code with one tweak and it worked. It was how you are passing in configs, you should use a named argument for your config because the second argument is actually something else, not the config. However, argument passing has been simplified in newspaper 0.0.6 anyways so you can directly pass in config arguments! I'll show both examples below.

Example 1:

from newspaper import Article

a = Article('http://arstechnica.com/information-technology/2014/01/quarkxpress-the-demise-of-a-design-desk-darling/', keep_article_html=True)

a.download()
a.parse()

print len(a.article_html)

Example 2:

from newspaper import Article, Config

c =  Config()
c.keep_article_html = True

a = Article('http://arstechnica.com/information-technology/2014/01/qu
arkxpress-the-demise-of-a-design-desk-darling/', config=c) # NAMED ARG

a.download()
a.parse()

print len(a.article_html)

Config objects are kinda a hassle, it's easier just to pass in raw arguments so we made it like that! No worries though, changing our api without warning was our fault! Good luck!

codelucas commented 10 years ago

Update me if this works so I can close the issue! Thanks.

WheresWardy commented 10 years ago

Yup, that seems to have done the trick! I just have one more quick (hopefully related) question: I want to expand the supported HTML tags to include more markup, so I've edited line 166 in newspaper/parsers.py to include the extra tags I want, but it doesn't seem to be taking effect (for example, I've added the h* tags, img, code and pre) - is this the only place that needs editing to support additional tags?

codelucas commented 10 years ago

Submit a pull request so everyone can see what you are doing.

Thanks

WheresWardy commented 10 years ago

No problem, it's in #21

codelucas commented 10 years ago

Your code looks like it should work. That is the only location where any change is required, you are right.

Hmm, this issue can come from lxml.html.clean, which is our cleaner. Refer to http://lxml.de/api/lxml.html.clean.Cleaner-class.html

Can you post a failure example case where the cleaner fails for your new tags?

WheresWardy commented 10 years ago

The article/code I posted above (the one that you corrected the config for) is a good example - after the first paragraph, there's an h2 element that gets stripped (as do all the paragraph titles in that article):

<h2>The rise</h2>

There's also quite a few images on that page, and they all get stripped.

WheresWardy commented 10 years ago

This should work as an example failure case (currently it outputs 0, but should hopefully output more than 0 if it's working)

import re
from newspaper import Article

a = Article('http://arstechnica.com/information-technology/2014/01/quarkxpress-the-demise-of-a-design-desk-darling/', keep_article_html=True)
a.download()
a.parse()

m = re.findall('<h2', a.article_html, flags=re.I)
print len(m)

codelucas commented 10 years ago

Sorry for the late response, school is picking up.

I found the error. Refer to this line of code: https://github.com/codelucas/newspaper/blob/master/newspaper/article.py#L194

So pretty much after the cleanup of the dom, newspaper scans the dom again and removes any tags that "looks like non-content, clusters of links, or paras with no gusto.". Gusto being a measurement for the body text extraction algorithm.

If you comment that line out, your len(m) value suddenly becomes 3, which is the correct value.

The tradeoff to leaving that line commented is that will probably receive more unwanted body text in exchange for missing out on perfecting this article_html feature. Whether or not we should do that is up for debate. I will merge your pull request but I won't comment out that post_cleanup line on any production code until we make a formal decision.

Thanks!

WheresWardy commented 10 years ago

No problem at all, you've been more than speedy with your responses to what is essentially a pretty tiny (and already existing) feature request from me!

That's perfect for now - any other cleanup of the resulting article_html after that line has been commented out I can always do myself. Thanks for all your help.

phoenixwizard commented 10 years ago

If I keep the keep_article_html=True as a config, I get the following error on running parse :

File "<stdin>", line 1, in <module>
File "/Users/aram/code/microblog/flask/lib/python2.7/site-packages/newspaper/article.py", line 221, in parse
  self.top_node)
File "/Users/aram/code/microblog/flask/lib/python2.7/site-packages/newspaper/outputformatters.py", line 47, in get_formatted
  html = self.convert_to_html()
File "/Users/aram/code/microblog/flask/lib/python2.7/site-packages/newspaper/outputformatters.py", line 68, in convert_to_html
  cleaned_node = self.parser.clean_article_html(self.get_top_node())
File "/Users/aram/code/microblog/flask/lib/python2.7/site-packages/newspaper/parsers.py", line 62, in clean_article_html
  article_cleaner = lxml.html.clean.Cleaner()
AttributeError: 'module' object has no attribute 'clean'

codelucas / newspaper

Retain HTML markup for extracted article #18