Open jemrobinson opened 5 years ago
I can successfully view an article list page visiting https://www.alternet.org/category/news-politics/page/2/ while connected to the St Pancras Wifi, though there was a GDPR pop up.
Looks like there's just an issue with the website itself? When I go direct to the site homepage in browser https://www.alternet.org/ and click politics, then the page number links they all have a 403. Popup is already handled
On further investigation, I don't have any issue accessing the site from another IP address that hasn't used the crawler and the 403 error appears to be anti-scraping measure. Looking at how to potentially get around this, we could use IP rotation (with proxies) in combination with rotating user agents. There is a rotating_proxies middleware that already exists to do this, but you need a list of proxies
I'm seeing some strange issues with
alternet.org
. If I go directly tohttps://www.alternet.org/category/news-politics/page/2/
I get a403 forbidden
error (both in the crawler and interactively) but if I go there through a link it works.This could be a cloudflare issue. Needs investigation.