codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.19k stars 2.12k forks source link

article_html does not keep the img tags #41

Open Knights22 opened 10 years ago

Knights22 commented 10 years ago

When extracting the article node with the html using a.article_html, the <img tags are not kept. I noticed that in the clean_html(cls,node) function, 'img' is allowed but why is it not included in the article_html output?

    article_cleaner.allow_tags = ['a', 'span', 'p', 'br', 'strong', 'b',
            'em', 'i', 'tt', 'code', 'pre', 'blockquote', 'img', 'h1',
            'h2', 'h3', 'h4', 'h5', 'h6']
    article_cleaner.remove_unknown_tags = False
codelucas commented 10 years ago

This is a known issue that myself and a few others have been looking over. I'll update you if I find out anything.

adamlwgriffiths commented 10 years ago

I noticed that Article.top_node and Article.clean_top_node differ slightly. Although clean_top_node is deep copied from top_node, clean_top_node includes img tags.

>>> import newspaper
>>> url = 'http://www.rockpapershotgun.com/2014/07/24/top-down-tracy-third-eye-crime/'
>>> n = newspaper.Article(url)
>>> n.download()
>>> n.parse()
>>> 'img' in ET.tostring(n.clean_top_node)
True
>>> 'img' in ET.tostring(n.top_node)
False

article_html is then extracted from Article.top_node. How a deep copy adds tags, I don't know.

codelucas commented 9 years ago

Hey guys! I just had the time to look this over and found the issue.

article_html is generated from the top_node which is the most important part of the DOM after heavy filtering. Unfortunately, sometimes the filtering isn't correct and important things get filtered out.

@adamlwgriffiths The deep copy isn't adding tags, rather text, article_html = output_formatter.get_formatted(self.top_node) actually manipulates the top_node (the function is badly named). In this case, the <img> was stripped out.

codelucas commented 9 years ago

Possible solutions:

adamlwgriffiths commented 9 years ago

In the end I ended up inheriting from the class and reimplementing that method. I don't think that code is still in use though.

Re: using clean_doc. Assuming clean_doc doesn't contain any stripped images (spam, ads, etc).

Thanks for looking into this, hopefully you can resolve it in a neat way =)

On Sat, Feb 7, 2015 at 12:01 AM, Lucas Ou-Yang notifications@github.com wrote:

Possible solutions:

-

Move the line self.clean_top_node = copy.deepcopy(self.top_node) a few lines upward to right after the top_node is computed. This still isn't full proof from having random tags filtered out because the calculation of the top_node involves having the HTML be cleaned first.

Just use clean_doc, the untouched and uncleaned doc if you need to search the DOM for anything from the original HTML>

— Reply to this email directly or view it on GitHub https://github.com/codelucas/newspaper/issues/41#issuecomment-73232918.

woozyking commented 9 years ago

I think properly resolving this would also allow things like article_images which can be seen as a subset of images that only attempts to grab more relevant images in context of actual article body