codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.06k stars 2.11k forks source link

Not extracting full text on what appear to be simple New York Times pages #698

Open dividor opened 5 years ago

dividor commented 5 years ago

Love this package, amazingly useful.

I am seeing a few sites that don't parse the full text, for example ...

from newspaper import Article
url = "https://www.nytimes.com/2019/03/06/technology/personaltech/key-duplicating-machine.html"
article = Article(url)
article.download()
article.parse()

No pay-wall or anything like that as far as I can tell, the text just stops at "In the event of a crime, the police could check whether a key was duplicated with KeyMe and track down who had copied it.", which is halfway through the article.

Another NYT example:

https://www.nytimes.com/2019/01/23/travel/pittsburgh-horror-filmmaker-george-romero.html

In both cases the next truncated paragraph is started with a quote, not sure if that is important or not.

I am on 0.2.8 of newspaper3k (and lxml==4.3.0).

Am I doing something wrong perhaps?

markschaver commented 5 years ago

Longstanding issue. See also https://github.com/codelucas/newspaper/issues/645