AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
479 stars 48 forks source link

Incorrect article text is extracted for multiple articles on some domains. #432

Open AndyTheFactory opened 1 year ago

AndyTheFactory commented 1 year ago

Issue by ariel-frischer Sun Feb 2 02:04:58 2020 Originally opened as https://github.com/codelucas/newspaper/issues/776


First off I would like to thank the creators for making this package free as it is a lifesaver and a timesaver. However, I'd like to address the issues I'm having with the extractor and perhaps find a workaround. My conda env has: newspaper3k=0.2.8=py37_0
The following is my sample article which is only extracting text multiple paragraphs below where the article actually begins: NYTIMES Sample. My extracted text begins with:

"In letters to state regulatory boards and in interviews..."

But it should begin with:

For Alyssa Watrous, the medication mix-up meant...

I've noticed this is the case for multiple articles on the nytimes website. I've just updated my packages, and that did not help. I would appreciate if anyone knows the source of these problems, I know fixing this package to correctly extract all websites perfectly may be unattainable but if there is a way I may look into fixing this myself. Below is my basic setup:

config = Config()  
article = Article(url, config=config) 
article.download()    
article.parse()    
article.nlp()
AndyTheFactory commented 1 year ago

Comment by ariel-frischer Sun Feb 2 21:37:38 2020


I also seem to get it for medium articles:

AndyTheFactory commented 1 year ago

Comment by ashkaushik Thu May 7 17:35:28 2020


@ArielFrischer any luck in fixing these issues? stuck with the same issue please help you have a solution?

AndyTheFactory commented 1 year ago

Comment by ariel-frischer Fri May 8 02:01:24 2020


@ashkaushik I have not got the experience or the time to delve into how the node structure works for this package. I would honestly pay someone to fix these issues if they have some expertise on this library. I just wish this project was better maintained, no updates in a while...

AndyTheFactory commented 1 year ago

Comment by kmgreen2 Fri Sep 18 16:15:01 2020


@ariel-frischer @ashkaushik Not sure if this helps, but this appears to be an issue with logic in ContentExtractor. I figured this was either an issue with either the downloader or parser. I verified the downloaded HTML is parsed fine using a simple bs4 parser that extracts the paragraph tags. After some debugging, It looks like calculate_best_node tries to extract the most "meaningful" subtree of text blocks, and may filter some out.

Anyway, if you run a debugger and step through, you'll see that nodes_to_check will have all of your text, but for some reason it may return a subtree that does not contain all text blocks.

I just started using this library this morning, so will dig a little deeper. Given the lack of comments around the heuristics used in calculate_best_node, I may create my own extractor that inherits ContentExtractor and overrides calculate_best_node with something else.

AndyTheFactory commented 1 year ago

Comment by kmgreen2 Fri Sep 18 16:56:00 2020


Here is a quick hack. Again, I just started using this package this morning, so I assume I may be missing something. That said, I'll likely fork and restructure Article to allow a custom extractor, instead of abusing Python's ability to mutate class internals. Looks like this has been a problem for a long time, so I assume it would take a long time to get a real fix on master.

This hack ignores the heuristic approach to building the "text subtree" directly from the DOM and just builds a new tree of height 2, where the children are filtered text nodes.

from newspaper.extractors import ContentExtractor
from newspaper import Article
from lxml import etree

class TextContextExtractor(ContentExtractor):
    def __init__(self, config):
        ContentExtractor.__init__(self, config)

    def calculate_best_node(self, doc):
        nodes_to_check = self.nodes_to_check(doc)
        root = etree.Element("root")

        for node in nodes_to_check:
            text_node = self.parser.getText(node)
            word_stats = self.stopwords_class(language=self.language). \
                get_stopword_count(text_node)
            high_link_density = self.is_highlink_density(node)
            if word_stats.get_stopword_count() > 2 and not high_link_density:
                text_element = etree.SubElement(root, "foo")
                text_element.text = text_node
        return root

if __name__ == '__main__':
    article = Article('https://www.nytimes.com/2017/02/23/us/politics/cpac-stephen-bannon-reince-priebus.html')
    article.extractor = TextContextExtractor(article.config)
    article.download()
    article.parse()
    print(article.text)

Hope this helps others :)

AndyTheFactory commented 10 months ago

Partially Solved. NyTimes is still not optimal