Open Knights22 opened 10 years ago
This is a known issue that myself and a few others have been looking over. I'll update you if I find out anything.
I noticed that Article.top_node and Article.clean_top_node differ slightly. Although clean_top_node is deep copied from top_node, clean_top_node includes img tags.
>>> import newspaper
>>> url = 'http://www.rockpapershotgun.com/2014/07/24/top-down-tracy-third-eye-crime/'
>>> n = newspaper.Article(url)
>>> n.download()
>>> n.parse()
>>> 'img' in ET.tostring(n.clean_top_node)
True
>>> 'img' in ET.tostring(n.top_node)
False
article_html is then extracted from Article.top_node. How a deep copy adds tags, I don't know.
Hey guys! I just had the time to look this over and found the issue.
article_html
is generated from the top_node
which is the most important part of the DOM after heavy filtering. Unfortunately, sometimes the filtering isn't correct and important things get filtered out.
@adamlwgriffiths The deep copy isn't adding tags, rather text, article_html = output_formatter.get_formatted(self.top_node)
actually manipulates the top_node (the function is badly named). In this case, the <img>
was stripped out.
Possible solutions:
self.clean_top_node = copy.deepcopy(self.top_node)
a few lines upward to right after the top_node
is computed. This still isn't full-proof from having random tags filtered out because the calculation of the top_node
involves having the HTML be cleaned first.clean_doc
, the untouched and uncleaned doc if you need to search the DOM for anything from the original HTML.In the end I ended up inheriting from the class and reimplementing that method. I don't think that code is still in use though.
Re: using clean_doc. Assuming clean_doc doesn't contain any stripped images (spam, ads, etc).
Thanks for looking into this, hopefully you can resolve it in a neat way =)
On Sat, Feb 7, 2015 at 12:01 AM, Lucas Ou-Yang notifications@github.com wrote:
Possible solutions:
-
Move the line self.clean_top_node = copy.deepcopy(self.top_node) a few lines upward to right after the top_node is computed. This still isn't full proof from having random tags filtered out because the calculation of the top_node involves having the HTML be cleaned first.
Just use clean_doc, the untouched and uncleaned doc if you need to search the DOM for anything from the original HTML>
— Reply to this email directly or view it on GitHub https://github.com/codelucas/newspaper/issues/41#issuecomment-73232918.
I think properly resolving this would also allow things like article_images
which can be seen as a subset of images
that only attempts to grab more relevant images in context of actual article body
When extracting the article node with the html using a.article_html, the <img tags are not kept. I noticed that in the clean_html(cls,node) function, 'img' is allowed but why is it not included in the article_html output?