everypolitician / everypolitician-data

data for national legislatures worldwide
http://everypolitician.org/
234 stars 54 forks source link

Austria #124

Open tmtmtmtm opened 9 years ago

tmtmtmtm commented 9 years ago

http://www.parlament.gv.at/WWER/NR/ABG/ — everyone since 1920

tmtmtmtm commented 9 years ago

This has a separate search by gender, though it doesn't seem to include gender information directly in the individual MP pages.

briatte commented 9 years ago

This part of my scraper tells you something about the number of party transitions that occurred under Haider (basically, many MPs who followed Haider in the BZÖ were marked as 'unaffiliated' for some time before that).

tmtmtmtm commented 9 years ago

Amtsgeheimnis.at have recently scraped this, and have offered to provide their data.

milh0use commented 9 years ago

I've written a scraper here: https://github.com/milh0use/austrian-parliament/blob/master/scraper.pl Connected it to morph.io here: https://morph.io/milh0use/austrian-parliament I've scraped all the data on the Austrian Parliament website which includes every member in the history of modern Austria. Current members are where term = XXV. Some questions:

tmtmtmtm commented 9 years ago

Wooohooo! That's superb @milh0use

Is it correct to have one row per person per parliamentary session?

Mostly, yes, though there should be one row per each membership within the session. For example if they change party/group in the middle, there should be a row for each (See notes at http://everypolitician.org/submitting.html)

the id field is not unique in the output dataset because many people served several terms. Is this a problem?

No, not at all — that's the common scenario, and everything should just work fine. When you run the scraper multiple times, you should take care not to duplicate rows, probably through use of a composite key of id, term (if only one membership row per term), or id, term, party, and start_date if recording group changes (the start_date is needed in case they transition from party A to B and then back to A within the same term)

I could probably extract the start and end dates of a person's membership of the parliament though this will usually correspond to the start and end date of the parliamentary session?

The preferred approach here is to also create a 'terms' table, with an id, name, start_date, and end_date for each term, and then only where the dates of a person's membership differ from that, add a start_date or end_date to the data table as well. Either way, we'll need a terms table to be able to import the data. (We could construct it by hand locally, but I'm presuming it's easier for you to spit it out whilst parsing)

Thanks for this — it's looking really great!

milh0use commented 9 years ago

So I've made some progress (which I'll push to GitHub soon).

I've fixed the SQLite database connection to write the columns in a more sensible order (the documentation for the Perl Database::Dumptruck module was a bit sparse so I had to read through the source code to work out how to control column orders).

I've created a composite key of PersonID-Term to test whether it's sufficient to accommodate the data and have a unique ID per personID per term. It's fallen over in one place so far where a person changed name mid-session so has two entries on the Austrian Parliament website in a single term. How has this sort of thing been addressed for other parliaments? The finer details are getting a bit messy. I'm not confident, for example, that a change of name will have been indicated consistently in the source data? Are you expecting/tolerating an acceptable error threshold? Or would you try to incorporate this sort of thing in the scraper?

Here's the entry for the member in question: http://www.parlament.gv.at/WWER/PAD_21150/index.shtml The "(bis 18.8.2012: Mag. Susanne Neuwirth)" means "until 18/8/2012: Mag. Susanne Neuwirth". I've had to scrape these individuals' pages for the photo, but I'm scraping the list of members from this page: http://www.parlament.gv.at/WWER/BR/MITGL/index.shtml?xdocumentUri=%2FWWER%2FBR%2FMITGL%2Findex.shtml&anwenden=Anwenden&BL=ALLE_BL&STEP=&FR=ALLE&NRBR=BR&FBEZ=FW_007&jsMode=&LISTE=&requestId=FE796F036A&letter=&WP=ALLE&listeId=7&R_WF=WP&GP=XXII&M=&W=W where she has two entries (under two different names) linking to the same Individual page. Maybe including Start_date in the composite ID would do the trick, though it still means scraping out the date of the change.

milh0use commented 9 years ago

Oh, I've sorted out a terms table in the SQLite database too...

tmtmtmtm commented 9 years ago

@milh0use sorry for the delay on this — it's been a crazy few days!

As you may have seen we've put up the current data, which was supplied to us in a spreadsheet by a group working there — but I'm very keen to get that replaced by your much fuller data!

In terms of people who have changed names, this is a bit of a limitation of the current simplified approach we're taking of just generating 'flat' tabular data. The underlying data format we transform everything into is able to cope with that just fine, but we don't have a sensible way of dealing with it here yet. As there's only one person, with one previous name for people I'd suggest just writing that out to separate columns of 'previous_name' and 'previous_name_end_date' and we'll work out at this end what to do with that. (We could create a whole new table for it, but that might be overkill)