ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.32k stars 130 forks source link

Add support for Cloudflare DDoS protection screen #115

Open ivan opened 6 years ago

ivan commented 6 years ago

Let's run their JavaScript somehow, or import someone else's solution to the Cloudflare problem.

gabefair commented 4 years ago

@ivan Could you explain your proposed solution a bit further? I would like to pick this up.

ivan commented 4 years ago

My first step would be to review the things on https://github.com/search?q=cloudflare+scrape+fork%3Atrue and see if anyone has solved the problem in a satisfactory manner where it can just be imported from grab-site and called in one of the wpull hooks. https://github.com/Anorov/cloudflare-scrape/blob/master/cfscrape/__init__.py looks promising, but I haven't tested it.

If none of the existing solutions are good, I would look at using https://github.com/microsoft/playwright to visit the cloudflare-protected page, then copy the cloudflare cookie out of the browser and add it to the grab-site cookie jar.