codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.14k stars 2.12k forks source link

Incorrect article extraction #128

Open attodorov opened 9 years ago

attodorov commented 9 years ago

Hey,

I really like newspaper - nice job ! It has one major flaw, though - on a lot of sites, like wsj.com, reuters.com, etc. - it doesn't exract the article text precisely, but also includes left/right menus such as "Top most read articles" and similar things. If you use a library like the one Instapaper uses, you can extract the actual article text without any extra clutter. Here is one example (written in Java) that works pretty well:

https://code.google.com/p/boilerpipe/

Thanks --Angel

codelucas commented 9 years ago

Give a few URL examples where newspaper fails, we'll debug it

Also, the algorithm newspaper uses is based off of Gravity Lab's goose extractor. There was a comparison in extraction performance and goose's algorithm did as well as boilerpipe. For the examples you are looking at newspaper fails but for others it's flipped.

TwistingTwists commented 8 years ago

I have compiled a small list of links where it does not work "properly".

http://www.nytimes.com/2015/11/30/us/politics/illinois-campaign-money-bruce-rauner.html --images between articles are not pulled. Sometimes, they are used to reference some graphs. (Possibly, use .md format for embedding images in text with images stored in a directory?)

https://medium.com/the-coffeelicious/chronicling-depression-with-photography-cedbeffa79ee#.4jioqqaqo -- incomplete article extraction.

https://medium.com/the-macroscope/sometimes-a-whale-dies-689720e3e456#.53lrtza7z -- incomplete article extraction

http://www.insightsonindia.com/2015/11/02/insights-into-editorial-for-a-truer-decentralization/ -- incomplete article extraction [At the end of the article, last paragraph is left out] I tried doing above link using clearly extension in chrome. Clearly is does complete article extraction. @codelucas, @attodorov : have a look at their code, if possible?

I'll add more as I find.

wayspurrchen commented 8 years ago

Wanted to +1 here with some issues on the Fortune site.

http://fortune.com/2015/09/09/the-siege-of-herbalife/

http://fortune.com/amazon-jeff-bezos-prime/

zhannar commented 8 years ago

Also +1 on here:

url = "http://www.nytimes.com/2010/10/07/us/politics/07manchin.html"

http://stackoverflow.com/questions/37392098/python-newspaper-library-why-is-it-missing-sizable-portions-of-articles

cmermingas commented 7 years ago

The extraction skips the first few paragraphs for this URL:

https://www.nytimes.com/2017/02/08/travel/budget-travel-frugal-traveler-top-tips-2017.html

agordon commented 7 years ago

Hello,

First paragraph is missed in some Engadget articles (https://www.engadget.com/2017/05/30/intel-core-i9-extreme/). while working in others (e.g. https://www.engadget.com/2017/05/29/whats-on-tv-house-of-cards-fear-the-walking-dead/).

From a very cursory look, it seems that for articles where the first paragraph(s) are missed, the website separates the first paragraphs from the others, perhaps for a "Read Fulll Article" kind-of-button, (hinted by the js-notMobileReferredByFbTw class in the example below).

This is the general structure of the "intel i9" article (link above):

<div class="o-article_block pb-15 pb-5@m- o-subtle_divider mt-n10@m-">
      <div class="grid@tl+">
        <div class="grid@tl+__cell col-8-of-12@tl+">
          <div class="article-text c-gray-1">
                          <p> [[ CONTENT OF FIRST PARAGRAPH ]] </p>
           </div>
        </div>
      </div>
</div>

<div class="js-notMobileReferredByFbTw">
      <div class="o-article_block pb-15 pb-5@m- mt-n35 mt-n25@m mt-n15@s">
        <div class="grid@tl+">
          <div class="full-width@tp- grid@tl+__cell col-8-of-12@tl+">
            <div class="article-text c-gray-1 no-review">
            <p>  [[ CONTENT OF ALL OTHER PARAGRAPHS ]] </p>
        <p>  </p>
            </div>
        </div>
    </div>
</div>

(BTW, thanks for a great package!)

ekingery commented 6 years ago

I am in a similar situation, in that I've found various minor issues with the way article content is parsed and extracted. Even if a bunch of relatively minor tweaks would be helpful (for example, using the <article> tag for this article which currently fails to parse properly), it's not clear to me that this library is actively maintained and would adopt them? Another thing that probably would be useful would be to make the library more flexible with configuration. For example, being able to modify the bad_chunks variable would be helpful to resolve the one-off issues people tend to encounter with specific sites.

The approach I've taken is to use both newspaper and goose3. My code then compares the results in an attempt to choose the result that looks most correct. We default to newspaper, and I've seen plenty of cases where newspaper is correct and goose3 is not. YMMV. Hope that helps!