codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.16k stars 2.12k forks source link

Does not work with nytimes? #645

Open yingyingww opened 6 years ago

yingyingww commented 6 years ago

Describe the bug Whenever I tried to extract contents from NYTimes articles, they are random and incomplete. I tried on Newspaper Demo page as well for NYTimes articles and I got - "Error converting html to string."

To Reproduce Steps to reproduce the behavior, i.e., provide a minimalist example how to reproduce the error. Include a minimalist code example. If you start with - http://newspaper-demo.herokuapp.com, and put in any NYtimes article, such as https://www.nytimes.com/2018/10/25/business/european-economy.html, you will find the error.

Expected behavior On Demo page, "Error converting html to string."

If this were run in a program or terminal, random words would appear.

panditarevolution commented 6 years ago

I found the same issue with articles on quantamagezine and several other news sites. This was also via the demo app

heisenburger commented 5 years ago

Me too for The New Yorker and some other sites

manoadamro commented 5 years ago

for many UK news outlets, you will only get the first paragraph

dividor commented 5 years ago

It's a pretty fundamental problem affecting major news outlets, any update please on whether this might be resolved?

dividor commented 5 years ago

Not sure if this is something that might be useful or not, but in order to try and programmatically identify partial extractions (which totally blow up model training), I am comparing article.text with what I get when I strip tags and extract text directly with beautiful soup ...

   newsroom3k_count = len(str(article.text).split(" "))

    soup = BeautifulSoup(article.html, "html.parser")

    # Get rid on some obvious tags
    for t in ['script','meta','head','svg','button','img','ul','href']:
        [s.extract() for s in soup(t)]

    # Remove anything that looks like navigation
    regex = re.compile('.*(menu|navigation|sidebar).*')
    for d in soup.find_all(class_ = regex):
        d.extract()

    # Now loop through tags and extract text. Recursive is false so we only get text once
    text = ""
    elements = soup.find_all(recursive=False)
    text = "\n".join("{}".format(el.get_text()) for el in elements)
    text = re.sub('\n', ' ', text)
    text = re.sub(' +', ' ', text)
    soup_count = len(str(text).split(" "))

    print("\n" + url + "; 3k count: " + str(newsroom3k_count) + "; html count: " + str(soup_count))
    print("Count ratio=" + str(newsroom3k_count/soup_count))

It's 'Fuzzy', but this at least gives me a way to identify cases where it's very likely we have a partial extraction (when the newsroom3k wordcount is less than 50% of the cruder extraction above word count).

Perhaps we could offer something like this in newsroom3k - a confidence level. It's also a cruder way of extracting text, no doubt better programmers than I can do a better job, but at least the above is agnostic to some of the issues we see for NYT.

mmaybeno commented 4 years ago

If you inspect the html of the NYT and other sites where the parsing is not working as expected, they are doing something specific about the divs that separate out all the content. My guess is newspaper trips up on this and thinks the article is done without recursively going back and processing everything.

mmaybeno commented 4 years ago

Upon further inspection, it has something to do with the calculate_best_node function in the extractor. For a given article, there is a heuristic method to find the top node of the element tree, after which it parses the text and sets it to .text. If you inspect the top node of an article from NYT, it is shortened, but you can transverse the node up a few times and get the full article.

upper_top_node = article.config.get_parser().getParent(article.top_node)
upper_top_node = article.config.get_parser().getParent(upper_top_node)
for node in list(upper_top_node):
    article.config.get_parser().getText(node)
mmaybeno commented 4 years ago

I've dug into this more as it really looks to be impacting how I'm obtaining articles. I've narrowed it down to specifically how it first gets nodes_to_check = self.nodes_to_check(doc) in the calculate_best_node.

Since these target only ['p', 'pre', 'td'] tags you may get all the text you want, but it hurts you further down in the methods implementation. To find which node to choose, it also finds the parent and grandparent nodes just to make sure if there is more text in the article that is useful. Problem here is that if the html has crazy nested divs this method will never see it. Instead you'll have multiple nodes with high scores (gravityScore) but only one can be selected.

The way I see around this is to either modify how far it goes up the parent chain to grab the rest of the article (maybe some threshold on score improvement) or some other solution to chain multiple high scoring nodes together.

Side note but also concerning, @codelucas hasn't pulled in any PRs recently so even trying to fix this would potentially not get merged in anytime soon.

kmgreen2 commented 4 years ago

@mmaybeno I also ran into this and landed in the same place you did (see #776 ). Adding some details here to see if it helps you in any way.

It seems this is happening because the scores are not being percolated up the DOM. It looks like it only goes as far as parent-of-parent: https://github.com/codelucas/newspaper/blob/f622011177f6c2e95e48d6076561e21c016f08c3/newspaper/extractors.py#L823

I think replacing updating the parent-of-parent with updating scores up the DOM could fix this issue. Something like this:

            # Percolate weights up through the tree.  Not really sure
            # what the upscore is, but seems that it should decay, so
            # decaying linearly.
            ancestor_node = self.parser.getParent(parent_node)
            j = 1.0
            while ancestor_node is not None:
                self.update_score(ancestor_node, upscore / (float(j)))
                if ancestor_node not in parent_nodes:
                    parent_nodes.append(ancestor_node)
                ancestor_node = self.parser.getParent(ancestor_node)
                j += 1

This could lead to the root node being selected. This code could be updated to stop at the text node's lowest common ancestor.

I just started looking at newspaper today, so am not sure how any of this is supposed to work. Specifically, not sure how the scoring is supposed to work, since most of the params are hardcoded.

I'll likely just fork to fix for myself, since this seems like it is not actively maintained.

mmaybeno commented 3 years ago

@kmgreen2 How did your changes work out? The idea makes sense, I'll have to check out your fork. Thanks!