codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
13.89k stars 2.1k forks source link

gnews with user agent returns empty text #976

Open wj210 opened 8 months ago

wj210 commented 8 months ago

I encountered some issue with scraping with gnews, these errors are along the lines of Articledownload()failed with 403 Client Error: Max restarts limit reached for url Articledownload()failed with 403 Client Error: Forbidden for url

So i followed https://github.com/johnbumgarner/newspaper3_usage_overview and implemented the user headers, but as soon as i do it, the article.text returns an empty str.

The links are google RSS articles. example "https://news.google.com/rss/articles/CBMifWh0dHBzOi8vc2Vla2luZ2FscGhhLmNvbS9hcnRpY2xlLzE4NDM5MzItdGhlLWV4cGxhbmF0aW9uLWJlaGluZC1hcHBsZXMtZ3Jvc3MtbWFyZ2luLWRlY2xpbmUtYW5kLXdoeS10aGUtZnV0dXJlLWxvb2tzLWJyaWdodGVy0gEA?oc=5&hl=en-SG&gl=SG&ceid=SG:en"

whereas the underlying link "https://seekingalpha.com/article/1843932-the-explanation-behind-apples-gross-margin-decline-and-why-the-future-looks-brighter" works fine.

johnbumgarner commented 8 months ago

Thanks for mentioning my usage document in this Issue. What sites give you a 403?