AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
464 stars 45 forks source link

article_html does not keep the img tags #5

Open AndyTheFactory opened 12 months ago

AndyTheFactory commented 12 months ago

Issue by Knights22 Sun Mar 2 12:50:56 2014 Originally opened as https://github.com/codelucas/newspaper/issues/41


When extracting the article node with the html using a.article_html, the <img tags are not kept. I noticed that in the clean_html(cls,node) function, 'img' is allowed but why is it not included in the article_html output?

    article_cleaner.allow_tags = ['a', 'span', 'p', 'br', 'strong', 'b',
            'em', 'i', 'tt', 'code', 'pre', 'blockquote', 'img', 'h1',
            'h2', 'h3', 'h4', 'h5', 'h6']
    article_cleaner.remove_unknown_tags = False
AndyTheFactory commented 12 months ago

Comment by codelucas Mon Mar 3 20:51:19 2014


This is a known issue that myself and a few others have been looking over. I'll update you if I find out anything.

AndyTheFactory commented 12 months ago

Comment by adamlwgriffiths Fri Jul 25 02:00:23 2014


I noticed that Article.top_node and Article.clean_top_node differ slightly. Although clean_top_node is deep copied from top_node, clean_top_node includes img tags.

>>> import newspaper
>>> url = 'http://www.rockpapershotgun.com/2014/07/24/top-down-tracy-third-eye-crime/'
>>> n = newspaper.Article(url)
>>> n.download()
>>> n.parse()
>>> 'img' in ET.tostring(n.clean_top_node)
True
>>> 'img' in ET.tostring(n.top_node)
False

article_html is then extracted from Article.top_node. How a deep copy adds tags, I don't know.

AndyTheFactory commented 12 months ago

Comment by codelucas Fri Feb 6 12:58:36 2015


Hey guys! I just had the time to look this over and found the issue.

article_html is generated from the top_node which is the most important part of the DOM after heavy filtering. Unfortunately, sometimes the filtering isn't correct and important things get filtered out.

@adamlwgriffiths The deep copy isn't adding tags, rather text, article_html = output_formatter.get_formatted(self.top_node) actually manipulates the top_node (the function is badly named). In this case, the <img> was stripped out.

AndyTheFactory commented 12 months ago

Comment by codelucas Fri Feb 6 13:01:07 2015


Possible solutions:

AndyTheFactory commented 12 months ago

Comment by adamlwgriffiths Sun Feb 8 11:45:30 2015


In the end I ended up inheriting from the class and reimplementing that method. I don't think that code is still in use though.

Re: using clean_doc. Assuming clean_doc doesn't contain any stripped images (spam, ads, etc).

Thanks for looking into this, hopefully you can resolve it in a neat way =)

On Sat, Feb 7, 2015 at 12:01 AM, Lucas Ou-Yang notifications@github.com wrote:

Possible solutions:

-

Move the line self.clean_top_node = copy.deepcopy(self.top_node) a few lines upward to right after the top_node is computed. This still isn't full proof from having random tags filtered out because the calculation of the top_node involves having the HTML be cleaned first.

Just use clean_doc, the untouched and uncleaned doc if you need to search the DOM for anything from the original HTML>

— Reply to this email directly or view it on GitHub https://github.com/codelucas/newspaper/issues/41#issuecomment-73232918.

AndyTheFactory commented 12 months ago

Comment by woozyking Fri May 29 05:40:59 2015


I think properly resolving this would also allow things like article_images which can be seen as a subset of images that only attempts to grab more relevant images in context of actual article body