ecprice / newsdiffs

Automatic scraper that tracks changes in news articles over time.
Other
497 stars 135 forks source link

Adding header information in http requests to avoid 403 errors #57

Closed msbt closed 6 years ago

msbt commented 6 years ago

I'm trying to scrape various pages and some of them can't be accessed, it seems they're blocking non-browser requests. I've stumbled across this snippet, but I don't know how and where to put it: https://stackoverflow.com/questions/13303449/urllib2-httperror-http-error-403-forbidden

Any pointers would be appreciated!

Best regards

msbt commented 6 years ago

Found the solution: simply add

if "sitename" in url:
    opener.addheaders= ...

in baseparser.py