everypolitician-scrapers / spain_congreso_es

Details of members of the Spanish Congress from the official website congreso.es
https://morph.io/everypolitician-scrapers/spain_congreso_es
1 stars 2 forks source link

Mirror pages to github #10

Closed struan closed 7 years ago

struan commented 8 years ago

This runs over all the pages on the site, saving them to a local cache directory and then using everypoliticianbot's with_git_repo magic commits them to the mirror-data branch of this repository.

Files are saved as a SHA of the absolute URL with the session data stripped out - see Mirror.sha_url for details of this. If a file with the same SHA as the URL exists it will not save the page. It will still visit the page though.

The actual scraping works by visiting every person page listed on the all people page of the site (http://www.congreso.es/portal/page/portal/Congreso/Congreso/Diputados/DiputadosTodasLegislaturas) and from there visiting the page that lists all the terms that person was elected to and visiting all the person pages linked from there.

Once it has done a complete sweep of all the pages it commits them, or if there is an exception then it commits what it has.

I've left in all the old code that scraped and parsed the existing site, although it might be easier to delete most of it, but at least the CSS/XPath bits might be useful.

Improvements that could be made: