fhamborg / news-please

news-please - an integrated web crawler and information extractor for news that just works
Apache License 2.0
2.05k stars 423 forks source link

Implement user agent functionality similar to News Paper 3k #255

Closed GiridharRNair closed 9 months ago

GiridharRNair commented 10 months ago

Mandatory

Describe your question I am currently developing an application that focuses on aggregating information about healthcare business strategy. In the process, I've encountered security issues while parsing articles using news-please, specifically receiving 403 errors. After investigating, I found that adding a user agent to the requests may help in bypassing these errors. However, I could not find a direct way to set a custom user agent in news-please.

Versions (please complete the following information):

Intent (optional; we'll use this info to prioritize upcoming tasks to work on)

In newspaper3k, the user agent can be set as follows:

from newspaper import Article
from newspaper import Config

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10

RETRY_ATTEMPTS = 3

def parse_article(url):
    for attempt in range(RETRY_ATTEMPTS):
        try:
            article = Article(url)
            return article
        except requests.RequestException as e:
            print(f"Error retrieving article from URL '{url}': {str(e)} ({attempt + 1}/{RETRY_ATTEMPTS})")
    return None

I suggest implementing a similar feature in news-please to allow users to set a custom user agent, which can be beneficial for cases where websites block requests without a user agent, resulting in 403 errors.

Additionally, if there is already a way to set a custom user agent in news-please that I am not aware of, could you please add this information to the readme to avoid confusion among users?

Thank you for considering this enhancement.