Mirror pages to github - Githubissues

This runs over all the pages on the site, saving them to a local cache directory and then using everypoliticianbot's with_git_repo magic commits them to the mirror-data branch of this repository.

Files are saved as a SHA of the absolute URL with the session data stripped out - see Mirror.sha_url for details of this. If a file with the same SHA as the URL exists it will not save the page. It will still visit the page though.

The actual scraping works by visiting every person page listed on the all people page of the site (http://www.congreso.es/portal/page/portal/Congreso/Congreso/Diputados/DiputadosTodasLegislaturas) and from there visiting the page that lists all the terms that person was elected to and visiting all the person pages linked from there.

Once it has done a complete sweep of all the pages it commits them, or if there is an exception then it commits what it has.

I've left in all the old code that scraped and parsed the existing site, although it might be easier to delete most of it, but at least the CSS/XPath bits might be useful.

Improvements that could be made:

[ ] Commit as we go so that each file is commited once it's been saved. I've not done this as the current way everypoliticianbot works doesn't allow this - if you move the with_git_repo wrapper round the file writing then it falls over
[ ] Better checks that the pages it's getting are the pages we expect. Sometime the site responds with error pages but they still have a 200 status code.
[ ] Rescraping existing files. If a file with the same SHA as the URL exists it currently moves on to the next page. This means we won't pick up changes at the moment
[ ] Maybe a mode where it only visits a page if it doesn't exist in the cache. I'm not sure how well this would work with Capybara.

everypolitician-scrapers / spain_congreso_es

Mirror pages to github #10