alan-turing-institute / misinformation-crawler

Web crawler to collect snapshots of articles to web archive
MIT License
5 stars 2 forks source link

alternet.org denying access to paged content #127

Open jemrobinson opened 5 years ago

jemrobinson commented 5 years ago

I'm seeing some strange issues with alternet.org. If I go directly to https://www.alternet.org/category/news-politics/page/2/ I get a 403 forbidden error (both in the crawler and interactively) but if I go there through a link it works.

This could be a cloudflare issue. Needs investigation.

martintoreilly commented 5 years ago

I can successfully view an article list page visiting https://www.alternet.org/category/news-politics/page/2/ while connected to the St Pancras Wifi, though there was a GDPR pop up.

edwardchalstrey1 commented 5 years ago

Looks like there's just an issue with the website itself? When I go direct to the site homepage in browser https://www.alternet.org/ and click politics, then the page number links they all have a 403. Popup is already handled

edwardchalstrey1 commented 5 years ago

On further investigation, I don't have any issue accessing the site from another IP address that hasn't used the crawler and the 403 error appears to be anti-scraping measure. Looking at how to potentially get around this, we could use IP rotation (with proxies) in combination with rotating user agents. There is a rotating_proxies middleware that already exists to do this, but you need a list of proxies