everypolitician-scrapers / spain_congreso_es

Details of members of the Spanish Congress from the official website congreso.es
https://morph.io/everypolitician-scrapers/spain_congreso_es
1 stars 2 forks source link

Don't hard-code the session ID into the URL #2

Closed tmtmtmtm closed 8 years ago

tmtmtmtm commented 8 years ago

One of the things that's causing this to fail so much is that the '_piref73_1333056_73_1333049_1333049' appears to be a session ID.

I think it might be better to start at http://www.congreso.es/portal/page/portal/Congreso/Congreso/Diputados/DiputadosTodasLegislaturas and iterate through everyone one by one.

URLs of the form 'http://www.congreso.es/portal/page/portal/Congreso/Congreso/Diputados/BusqForm?next_page=/wc/fichaDiputado&idDiputado=%s&idLegislatura=%s' appear to work consistently for people (without a session ID) but the page that links to all the historic terms seems to need to be 'followed' with the session ID included.

I'd probably break this up as:

  1. Scrape the 'DiputadosTodasLegislaturas' page, and write out a list of all the person/term pairs to a memberships table (unless that data already exists, or a RESCRAPE_ALL environment variable is set)
  2. For each entry in the memberships table, scrape that person/term page to the data table (unless it already exists, or a RESCRAPE_TERM=N environment variable is set). Then scrape that person's other historic memberships and add them to the memberships table to be scraped too.

Even if this fails to run to completion at any stage, it should be strictly additive on new runs. Once it's gathered all the historic data we can tweak it slightly to only pick up new data on future runs.

The only thing that seems slightly tricky is finding a suitable overarching ID for a member. I haven't looked deeply enough into that yet to see if there's anything that could be derived from their most recent membership page. If there is nothing, then we might need to tweak the order of #2 slightly to scrape the historic memberships page first, to find their earliest known ID (which would presumably remain consistent once we're in a historic term).