Open AndyTheFactory opened 1 year ago
Comment by ashkaushik Thu May 7 17:35:28 2020
@ArielFrischer any luck in fixing these issues? stuck with the same issue please help you have a solution?
Comment by ariel-frischer Fri May 8 02:01:24 2020
@ashkaushik I have not got the experience or the time to delve into how the node structure works for this package. I would honestly pay someone to fix these issues if they have some expertise on this library. I just wish this project was better maintained, no updates in a while...
Comment by kmgreen2 Fri Sep 18 16:15:01 2020
@ariel-frischer @ashkaushik Not sure if this helps, but this appears to be an issue with logic in ContentExtractor
. I figured this was either an issue with either the downloader or parser. I verified the downloaded HTML is parsed fine using a simple bs4 parser that extracts the paragraph tags. After some debugging, It looks like calculate_best_node
tries to extract the most "meaningful" subtree of text blocks, and may filter some out.
Anyway, if you run a debugger and step through, you'll see that nodes_to_check
will have all of your text, but for some reason it may return a subtree that does not contain all text blocks.
I just started using this library this morning, so will dig a little deeper. Given the lack of comments around the heuristics used in calculate_best_node, I may create my own extractor that inherits ContentExtractor
and overrides calculate_best_node
with something else.
Comment by kmgreen2 Fri Sep 18 16:56:00 2020
Here is a quick hack. Again, I just started using this package this morning, so I assume I may be missing something. That said, I'll likely fork and restructure Article
to allow a custom extractor, instead of abusing Python's ability to mutate class internals. Looks like this has been a problem for a long time, so I assume it would take a long time to get a real fix on master.
This hack ignores the heuristic approach to building the "text subtree" directly from the DOM and just builds a new tree of height 2, where the children are filtered text nodes.
from newspaper.extractors import ContentExtractor
from newspaper import Article
from lxml import etree
class TextContextExtractor(ContentExtractor):
def __init__(self, config):
ContentExtractor.__init__(self, config)
def calculate_best_node(self, doc):
nodes_to_check = self.nodes_to_check(doc)
root = etree.Element("root")
for node in nodes_to_check:
text_node = self.parser.getText(node)
word_stats = self.stopwords_class(language=self.language). \
get_stopword_count(text_node)
high_link_density = self.is_highlink_density(node)
if word_stats.get_stopword_count() > 2 and not high_link_density:
text_element = etree.SubElement(root, "foo")
text_element.text = text_node
return root
if __name__ == '__main__':
article = Article('https://www.nytimes.com/2017/02/23/us/politics/cpac-stephen-bannon-reince-priebus.html')
article.extractor = TextContextExtractor(article.config)
article.download()
article.parse()
print(article.text)
Hope this helps others :)
Partially Solved. NyTimes is still not optimal
Issue by ariel-frischer Sun Feb 2 02:04:58 2020 Originally opened as https://github.com/codelucas/newspaper/issues/776
First off I would like to thank the creators for making this package free as it is a lifesaver and a timesaver. However, I'd like to address the issues I'm having with the extractor and perhaps find a workaround. My conda env has:
newspaper3k=0.2.8=py37_0
The following is my sample article which is only extracting text multiple paragraphs below where the article actually begins: NYTIMES Sample. My extracted text begins with:
But it should begin with:
I've noticed this is the case for multiple articles on the nytimes website. I've just updated my packages, and that did not help. I would appreciate if anyone knows the source of these problems, I know fixing this package to correctly extract all websites perfectly may be unattainable but if there is a way I may look into fixing this myself. Below is my basic setup: