codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.19k stars 2.12k forks source link

Post cleanup letting through bad content in <p> tag with highlink density #629

Open youhyunkim opened 6 years ago

youhyunkim commented 6 years ago

Hello!

Firstly, wanted to say how great and useful this tool is. It's been very useful for me.

I've ran across an issue with content parsing for this one site (247wallst). For some articles in 247wallst's website, the parser outputs some high density link texts.

Here's the example: article url: https://247wallst.com/special-report/2018/09/19/americas-fastest-growing-and-shrinking-housing-markets/

article.text: A decade has passed since the U.S. housing market crash and the beginning of the 2008 financial crisis. The typical American home lost about a third of its value during the recession. While the median home price has since surpassed its pre-crisis levels and reached an all-time high of $269,000 in the second quarter of 2018, the recovery from the Great Recession has been largely uneven. Some housing markets have even been on the decline in the recent years. Like anything else, home prices are driven by supply and demand forces, which are highly correlated with the area’s economy, job market, and population changes. Demand tends to be higher in stronger economies, with a healthy job market, and a growing population. While the price of a typical single-family home rose by more the $100,000 in some of the more high-demand metropolitan areas from the second quarter of 2017 to the second quarter of 2018, in a handful of cities, the median home value declined over the past year. Based on median single-family home price changes over the year through the second quarter from the National Association of Realtors, 24/7 Wall St. reviewed the fastest growing (and shrinking) housing markets. Click here to see the full list of America’s fastest growing housing markets. Click here to see the full list of America’s fastest shrinking housing markets. Click here to see our detailed findings and methodology.

See that the last three sentences are link texts and should be removed from the content body: Click here to see the full list of America’s fastest growing housing markets. Click here to see the full list of America’s fastest shrinking housing markets. Click here to see our detailed findings and methodology.

In the HTML of the original article, those links take the structure:

<p>
  <a>
    <span>
      <strong>Link Text</strong>
    <span>
  </a>
</p>

These links are part of the top_node, which is fine because it's part of the article body tag. However in the post_cleanup step, I thought it should be removed. But when I looked at the code, it seems like all <p> tag's are not removed.

See code reference: https://github.com/codelucas/newspaper/blob/master/newspaper/extractors.py#L1043-L1045

Is there any case where just the is_highlink_density check is enough? In other words, would it be possible to just remove the <p> tag check if there's also a is_highlink_density check? This would resolve all cases where there are links that take the form:

<p><a>non-content text</a></p>

I'll gladly provide any more information if you need. Any help or explanation would be appreciated! Thanks!

codelucas commented 6 years ago

Thanks for the awesome issue and writeup @youhyunkim! 💯

The point you bring up is valid, but since newspaper is just heuristic based and does not have any machine learning we can't get too sophisticated with the rules, since every heuristic change will be good for some news websites and bad for others.

The most recent change in this space was done here: https://github.com/codelucas/newspaper/commit/7697eb021370892334c1da1aa7ae0879a5f5f1f5

Where we made the cleaning step more lenient, if you can prove with 100 sample articles that this cleanup step can be avoided without damaging the full-text extraction I'd be in favor of making the change. We may need something more sophisticated than removing the

check though since sometimes an article actually does have high link density but because they put it into a <p> the publisher is telling us it is part of the article