Closed sutgeorge closed 3 months ago
Hi @sutgeorge , could you share your code?
Sure @rajitkhanna, this is a snippet of the Jupyter Notebook that I used:
import newspaper
import tqdm
from newspaper import Article, Config
from bs4 import BeautifulSoup
config = Config()
config.browser_user_agent = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0"
url = 'https://www.capital.ro/{}/page/'.format(section) + str(page_number)
page = Article(url, language='ro', config=config)
page.download()
...
Obviously, you can replace the URL with anything you'd like (I wanted to scrape the page containing a list of articles from a news publication).
Hello,
I think quite a lot of people seem to have created issues similar to this one. I solved my problem with the user agent trick (I was not allowed to scrape the contents of a website, for whatever reason, and the result of
article.html
was basically an empty string).Either way, I found out that the solution is to use a
Config
object as a parameter to theArticle
class, with thebrowser_user_agent
set to something likeMozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:108.0) Gecko/20100101 Firefox/108.0
. I'm wondering if this detail should be added to the main README.md file or not. I'm convinced that this will be helpful and will save a lot of time for other people.Thank you.