codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.09k stars 2.11k forks source link

There seem to be complaints related to the user agent scraping permission issue #1002

Closed sutgeorge closed 3 months ago

sutgeorge commented 5 months ago

Hello,

I think quite a lot of people seem to have created issues similar to this one. I solved my problem with the user agent trick (I was not allowed to scrape the contents of a website, for whatever reason, and the result of article.html was basically an empty string).

Either way, I found out that the solution is to use a Config object as a parameter to the Article class, with the browser_user_agent set to something like Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0. I'm wondering if this detail should be added to the main README.md file or not. I'm convinced that this will be helpful and will save a lot of time for other people.

Thank you.

rajitkhanna commented 3 months ago

Hi @sutgeorge , could you share your code?

sutgeorge commented 3 months ago

Sure @rajitkhanna, this is a snippet of the Jupyter Notebook that I used:

import newspaper
import tqdm
from newspaper import Article, Config
from bs4 import BeautifulSoup

config = Config()
config.browser_user_agent = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0"

url = 'https://www.capital.ro/{}/page/'.format(section) + str(page_number)
page = Article(url, language='ro', config=config)
page.download()

...
sutgeorge commented 3 months ago

Obviously, you can replace the URL with anything you'd like (I wanted to scrape the page containing a list of articles from a news publication).