codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.05k stars 2.11k forks source link

Incorrect article text is extracted for multiple articles on some domains. #776

Open ariel-frischer opened 4 years ago

ariel-frischer commented 4 years ago

First off I would like to thank the creators for making this package free as it is a lifesaver and a timesaver. However, I'd like to address the issues I'm having with the extractor and perhaps find a workaround. My conda env has: newspaper3k=0.2.8=py37_0
The following is my sample article which is only extracting text multiple paragraphs below where the article actually begins: NYTIMES Sample. My extracted text begins with:

"In letters to state regulatory boards and in interviews..."

But it should begin with:

For Alyssa Watrous, the medication mix-up meant...

I've noticed this is the case for multiple articles on the nytimes website. I've just updated my packages, and that did not help. I would appreciate if anyone knows the source of these problems, I know fixing this package to correctly extract all websites perfectly may be unattainable but if there is a way I may look into fixing this myself. Below is my basic setup:

config = Config()  
article = Article(url, config=config) 
article.download()    
article.parse()    
article.nlp()
ariel-frischer commented 4 years ago

I also seem to get it for medium articles:

ashkaushik commented 4 years ago

@ArielFrischer any luck in fixing these issues? stuck with the same issue please help you have a solution?

ariel-frischer commented 4 years ago

@ashkaushik I have not got the experience or the time to delve into how the node structure works for this package. I would honestly pay someone to fix these issues if they have some expertise on this library. I just wish this project was better maintained, no updates in a while...

kmgreen2 commented 3 years ago

@ariel-frischer @ashkaushik Not sure if this helps, but this appears to be an issue with logic in ContentExtractor. I figured this was either an issue with either the downloader or parser. I verified the downloaded HTML is parsed fine using a simple bs4 parser that extracts the paragraph tags. After some debugging, It looks like calculate_best_node tries to extract the most "meaningful" subtree of text blocks, and may filter some out.

Anyway, if you run a debugger and step through, you'll see that nodes_to_check will have all of your text, but for some reason it may return a subtree that does not contain all text blocks.

I just started using this library this morning, so will dig a little deeper. Given the lack of comments around the heuristics used in calculate_best_node, I may create my own extractor that inherits ContentExtractor and overrides calculate_best_node with something else.

kmgreen2 commented 3 years ago

Here is a quick hack. Again, I just started using this package this morning, so I assume I may be missing something. That said, I'll likely fork and restructure Article to allow a custom extractor, instead of abusing Python's ability to mutate class internals. Looks like this has been a problem for a long time, so I assume it would take a long time to get a real fix on master.

This hack ignores the heuristic approach to building the "text subtree" directly from the DOM and just builds a new tree of height 2, where the children are filtered text nodes.

from newspaper.extractors import ContentExtractor
from newspaper import Article
from lxml import etree

class TextContextExtractor(ContentExtractor):
    def __init__(self, config):
        ContentExtractor.__init__(self, config)

    def calculate_best_node(self, doc):
        nodes_to_check = self.nodes_to_check(doc)
        root = etree.Element("root")

        for node in nodes_to_check:
            text_node = self.parser.getText(node)
            word_stats = self.stopwords_class(language=self.language). \
                get_stopword_count(text_node)
            high_link_density = self.is_highlink_density(node)
            if word_stats.get_stopword_count() > 2 and not high_link_density:
                text_element = etree.SubElement(root, "foo")
                text_element.text = text_node
        return root

if __name__ == '__main__':
    article = Article('https://www.nytimes.com/2017/02/23/us/politics/cpac-stephen-bannon-reince-priebus.html')
    article.extractor = TextContextExtractor(article.config)
    article.download()
    article.parse()
    print(article.text)

Hope this helps others :)