AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
481 stars 49 forks source link

Incorrect article extraction #12

Open AndyTheFactory opened 1 year ago

AndyTheFactory commented 1 year ago

Issue by attodorov Tue Mar 10 13:10:38 2015 Originally opened as https://github.com/codelucas/newspaper/issues/128


Hey,

I really like newspaper - nice job ! It has one major flaw, though - on a lot of sites, like wsj.com, reuters.com, etc. - it doesn't exract the article text precisely, but also includes left/right menus such as "Top most read articles" and similar things. If you use a library like the one Instapaper uses, you can extract the actual article text without any extra clutter. Here is one example (written in Java) that works pretty well:

https://code.google.com/p/boilerpipe/

Thanks --Angel

AndyTheFactory commented 1 year ago

Comment by codelucas Sat Mar 14 22:07:32 2015


Give a few URL examples where newspaper fails, we'll debug it

Also, the algorithm newspaper uses is based off of Gravity Lab's goose extractor. There was a comparison in extraction performance and goose's algorithm did as well as boilerpipe. For the examples you are looking at newspaper fails but for others it's flipped.

AndyTheFactory commented 1 year ago

Comment by TwistingTwists Mon Nov 30 18:44:07 2015


I have compiled a small list of links where it does not work "properly".

http://www.nytimes.com/2015/11/30/us/politics/illinois-campaign-money-bruce-rauner.html --images between articles are not pulled. Sometimes, they are used to reference some graphs. (Possibly, use .md format for embedding images in text with images stored in a directory?)

https://medium.com/the-coffeelicious/chronicling-depression-with-photography-cedbeffa79ee#.4jioqqaqo -- incomplete article extraction.

https://medium.com/the-macroscope/sometimes-a-whale-dies-689720e3e456#.53lrtza7z -- incomplete article extraction

http://www.insightsonindia.com/2015/11/02/insights-into-editorial-for-a-truer-decentralization/ -- incomplete article extraction [At the end of the article, last paragraph is left out] I tried doing above link using clearly extension in chrome. Clearly is does complete article extraction. @codelucas, @attodorov : have a look at their code, if possible?

I'll add more as I find.

AndyTheFactory commented 1 year ago

Comment by wayspurrchen Fri Mar 25 02:30:10 2016


Wanted to +1 here with some issues on the Fortune site.

http://fortune.com/2015/09/09/the-siege-of-herbalife/

http://fortune.com/amazon-jeff-bezos-prime/

AndyTheFactory commented 1 year ago

Comment by zhannar Mon May 23 13:58:44 2016


Also +1 on here:

url = "http://www.nytimes.com/2010/10/07/us/politics/07manchin.html"

http://stackoverflow.com/questions/37392098/python-newspaper-library-why-is-it-missing-sizable-portions-of-articles

AndyTheFactory commented 1 year ago

Comment by cmermingas Thu Feb 9 02:12:47 2017


The extraction skips the first few paragraphs for this URL:

https://www.nytimes.com/2017/02/08/travel/budget-travel-frugal-traveler-top-tips-2017.html

AndyTheFactory commented 1 year ago

Comment by agordon Tue May 30 19:41:56 2017


Hello,

First paragraph is missed in some Engadget articles (https://www.engadget.com/2017/05/30/intel-core-i9-extreme/). while working in others (e.g. https://www.engadget.com/2017/05/29/whats-on-tv-house-of-cards-fear-the-walking-dead/).

From a very cursory look, it seems that for articles where the first paragraph(s) are missed, the website separates the first paragraphs from the others, perhaps for a "Read Fulll Article" kind-of-button, (hinted by the js-notMobileReferredByFbTw class in the example below).

This is the general structure of the "intel i9" article (link above):

<div class="o-article_block pb-15 pb-5@m- o-subtle_divider mt-n10@m-">
      <div class="grid@tl+">
        <div class="grid@tl+__cell col-8-of-12@tl+">
          <div class="article-text c-gray-1">
                          <p> [[ CONTENT OF FIRST PARAGRAPH ]] </p>
           </div>
        </div>
      </div>
</div>

<div class="js-notMobileReferredByFbTw">
      <div class="o-article_block pb-15 pb-5@m- mt-n35 mt-n25@m mt-n15@s">
        <div class="grid@tl+">
          <div class="full-width@tp- grid@tl+__cell col-8-of-12@tl+">
            <div class="article-text c-gray-1 no-review">
            <p>  [[ CONTENT OF ALL OTHER PARAGRAPHS ]] </p>
        <p>  </p>
            </div>
        </div>
    </div>
</div>

(BTW, thanks for a great package!)

AndyTheFactory commented 1 year ago

Comment by ekingery Mon Oct 15 21:49:27 2018


I am in a similar situation, in that I've found various minor issues with the way article content is parsed and extracted. Even if a bunch of relatively minor tweaks would be helpful (for example, using the <article> tag for this article which currently fails to parse properly), it's not clear to me that this library is actively maintained and would adopt them? Another thing that probably would be useful would be to make the library more flexible with configuration. For example, being able to modify the bad_chunks variable would be helpful to resolve the one-off issues people tend to encounter with specific sites.

The approach I've taken is to use both newspaper and goose3. My code then compares the results in an attempt to choose the result that looks most correct. We default to newspaper, and I've seen plenty of cases where newspaper is correct and goose3 is not. YMMV. Hope that helps!