AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
509 stars 51 forks source link

gnews with user agent returns empty text #582

Closed AndyTheFactory closed 8 months ago

AndyTheFactory commented 1 year ago

Issue by wj210 Wed Oct 18 03:50:22 2023 Originally opened as https://github.com/codelucas/newspaper/issues/976


I encountered some issue with scraping with gnews, these errors are along the lines of Articledownload()failed with 403 Client Error: Max restarts limit reached for url Articledownload()failed with 403 Client Error: Forbidden for url

So i followed https://github.com/johnbumgarner/newspaper3_usage_overview and implemented the user headers, but as soon as i do it, the article.text returns an empty str.

The links are google RSS articles. example "https://news.google.com/rss/articles/CBMifWh0dHBzOi8vc2Vla2luZ2FscGhhLmNvbS9hcnRpY2xlLzE4NDM5MzItdGhlLWV4cGxhbmF0aW9uLWJlaGluZC1hcHBsZXMtZ3Jvc3MtbWFyZ2luLWRlY2xpbmUtYW5kLXdoeS10aGUtZnV0dXJlLWxvb2tzLWJyaWdodGVy0gEA?oc=5&hl=en-SG&gl=SG&ceid=SG:en"

whereas the underlying link "https://seekingalpha.com/article/1843932-the-explanation-behind-apples-gross-margin-decline-and-why-the-future-looks-brighter" works fine.

AndyTheFactory commented 8 months ago

added Gnews integration for better handling