AndyTheFactory / newspaper4k

📰 Newspaper4k a fork of the beloved Newspaper3k. Extraction of articles, titles, and metadata from news websites.
MIT License
458 stars 43 forks source link

Cloudfare Issue with CRHOY.com #647

Open gabrielgq opened 2 months ago

gabrielgq commented 2 months ago

CRHOY:

This is a Cloudflare issue so I don't know if this is the right place to post but if anyone can help I'd be vary thankful.

crhoy.com

Some sample urls that I have tried

crhoy.com/economia/estas-son-las-razones-por-las-que-sugef-recomienda-destituir-a-presidente-del-popular crhoy.com/economia/empresarios-piden-avanzar-en-proyectos-para-mejorar-la-competitividad

The exact code i used to test this articles/website


import newspaper

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = newspaper.configuration.Configuration()
config.browser_user_agent = user_agent

article = newspaper.article('https://www.crhoy.com/economia/estas-son-las-razones-por-las-que-sugef-recomienda-destituir-a-presidente-del-popular/', config=config)
print(article.text)

Site is protected by Cloudflare I tried more complex methods with readability and selenium, even used 12ft.io and http://txtify.it

femdias commented 1 month ago

Hey Gabriel! I was having the same problem also, then I found out that the 0.9.3 updated include the addition of cloudscraper (see changelog). You can read the documentation of cloudscraper library here, it basically modifies requests to bypass Cloudflare. For using it in newspaper4k, you just have to install cloudscraper (pip install cloudscraper), as the code automatically uses it if installed.

Hope it helps!

gabrielgq commented 1 month ago

Thanks, I added cloudscraper but sadly it still doesn't work for the site I mentioned. Did the sample URLs work for you?