codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
13.9k stars 2.11k forks source link

I can see the article but cannot download it via newspaper3k #829

Open monajalal opened 3 years ago

monajalal commented 3 years ago

I can see the http://www.chicagotribune.com/ct-florida-school-shooter-nikolas-cruz-20180217-story.html when browsing in Firefox. However, newspaper3k gives me this error:

Articledownload()failed with HTTPSConnectionPool(host='www.chicagotribune.com', port=443): Read timed out. (read timeout=7) on URL http://www.chicagotribune.com/ct-florida-school-shooter-nikolas-cruz-20180217-story.html

My code is:

from newspaper import Article
from newspaper import Config

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()

config.browser_user_agent = user_agent

url = "https://www.chicagotribune.com/nation-world/ct-florida-school-shooter-nikolas-cruz-20180217-story.html"

page = Article(url, config=config)

page.download()
page.parse()
print(page.text)
johnbumgarner commented 3 years ago

Your code works fine, but something at a precise moment in time caused the 'read timed out' to occur. newspaper3k support timeout in the Config(), which could help prevent future 'read timed out' issues.

reference requests timeout

from newspaper import Article
from newspaper import Config

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()

config.browser_user_agent = user_agent
config.request_timeout = 10

url = "https://www.chicagotribune.com/nation-world/ct-florida-school-shooter-nikolas-cruz-20180217-story.html"

page = Article(url, config=config)

page.download()
page.parse()
print(page.text)