create a prompt for comparing Bundestag members to de.wikipedia.org's list

everypolitician / democratic-commons-tasks

An issues-only repository for tracking work for the Democratic Commons project

0 stars 0 forks source link

create a prompt for comparing Bundestag members to de.wikipedia.org's list #1

Open mhl opened 6 years ago

mhl commented 6 years ago

We would like to be able to compare the current members of the Bundestag according to Wikidata with those on the German Wikipedia list, to see if it's up to date. We do this by creating a prompt (see https://tools.wmflabs.org/prompter ) which displays a comparison based on:

A CSV file from a morph.io-hosted scraper of https://de.wikipedia.org/wiki/Liste_der_Mitglieder_des_Deutschen_Bundestages_(19._Wahlperiode) (although it looks like this might already be done here: https://morph.io/tmtmtmtm/germany-bundestag-members-wikipedia )
A SPARQL query that lists all the current members of the Bundestag known in Wikidata

tmtmtmtm commented 6 years ago

although it looks like this might already be done here: https://morph.io/tmtmtmtm/germany-bundestag-members-wikipedia

That scraper doesn't currently cover the latest term. It might be as simple as extending the range from 1..18 to 1..19 but at a glance it looks as if the latest table uses a slightly different layout. We also don't currently extract the Wikidata IDs for constituencies or parties, which would be more necessary for this new usage, and the latter of which might, again, be different across different terms (as they aren't linked in this table, but would need cross-referenced from an earlier lookup).

We could extend the existing scraper to cope with this, but I'm not sure that that would buy us very much at this point. These scrapers are for quite different requirements, and so I'd be very tempted to simply make a new scraper just for the 19th Bundestag. That would then also have the useful benefit of isolating future layout changes of current information from similar with historic information, such that the scraper breaking for one of these wouldn't cause the other (largely-unrelated) use to also fail.

henare commented 6 years ago

It might be as simple as extending the range from 1..18 to 1..19 but at a glance it looks as if the latest table uses a slightly different layout.

Just confirming that indeed doesn't work:

$ be ruby scraper.rb 
19
Can't find Wikidata IDs for: Logo des Deutschen Bundestages in de
scraper.rb:33:in `block in <class:MemberRow>': undefined method `downcase' for nil:NilClass (NoMethodError)
    from /home/henare/.rvm/gems/ruby-2.3.3/gems/field_serializer-0.3.0/lib/field_serializer.rb:26:in `block in to_h'
    from /home/henare/.rvm/gems/ruby-2.3.3/gems/field_serializer-0.3.0/lib/field_serializer.rb:25:in `map'
    from /home/henare/.rvm/gems/ruby-2.3.3/gems/field_serializer-0.3.0/lib/field_serializer.rb:25:in `to_h'
    from scraper.rb:135:in `block in scrape_term'
    from scraper.rb:134:in `map'
    from scraper.rb:134:in `scrape_term'
    from scraper.rb:145:in `block in <main>'
    from scraper.rb:142:in `reverse_each'
    from scraper.rb:142:in `<main>'

henare commented 6 years ago

Alright, so here's where I've got to so far:

I stripped back the scraper Tony uses as his go-to scraper for a template and updated the Ruby and dependencies to the latest versions :chart_with_upwards_trend: Then I added a basic scraper, again using a lot of the code from the template scraper :clipboard:

Next I wanted to see how that would look using the Prompt so I created one using the Wikidata template.

Because you need an API key to get morph CSVs I just downloaded the CSV from the morph scraper and put it in a gist to test.

I came up with the simplest possible SPARQL that just gets the Wikidata IDs of all members of the 19th Bundestag and then ran the compare :eyes:

The compare shows a handful of differences. I had a poke to try and understand what these differences are. I saw one that looks like a duplicate ID and another where there are two people with the same name (which one is right?) :busts_in_silhouette:

It's the end of the day so I'm going to stop here. On Monday I need to work out what the next steps are. Should we be comparing more fields? What should we be doing about the discrepancies? Stay tuned! :bat:

tmtmtmtm commented 6 years ago

Woohoo!

To get at morph CSVs without exposing API keys, we already have a template for that :)

{{Morph CSV proxy URL
|scraper=everypolitician-scrapers/…
|query=SELECT id AS wikidata FROM data
}}

Resolving any obvious issues (like people who can be merged) can be a useful first step, but anything more complex we'll usually just pass back to in-country partners, so don't worry too much about working out what to do there — as long as the diff looks like it's finding problems correctly (as this one does ✨ ), fixing those problems becomes a separate issue.

[In case like this where some people come up as no name, it's sometimes worth also going in and adding English labels for them, just so it's more obvious what's going on. I've done that for some of the people here, but I've left a few if you want to try that out.]

The next step would indeed be to compare some other fields. P4100 (parliamentary group) and P768 (P768) are almost certainly the most useful at this stage.

henare commented 6 years ago

To get at morph CSVs without exposing API keys, we already have a template for that :)

Very nice :heart_eyes: I've switched my prompt to use that.

Resolving any obvious issues (like people who can be merged) can be a useful first step, but anything more complex we'll usually just pass back to in-country partners, so don't worry too much about working out what to do there — as long as the diff looks like it's finding problems correctly (as this one does :sparkles: ), fixing those problems becomes a separate issue.

To merge people, Tony told me about the Wikidata merge gadget. You first need to enable it in your preferences. It's the first item on your Preferences > Gadgets page. Then to use it click the More link on an item's page:

screenshot-2017-11-20 bernd buchholz

[In case like this where some people come up as no name, it's sometimes worth also going in and adding English labels for them, just so it's more obvious what's going on. I've done that for some of the people here, but I've left a few if you want to try that out.]

Thanks for leaving some for me to do :+1: I've now filled them all out and it was looking much cleaner.

The next step would indeed be to compare some other fields. P4100 (parliamentary group) and P768 (P768) are almost certainly the most useful at this stage.

Constituency was straightforward as the ID decorator had done it's magic on the scraper.

Parliamentary group is a bit more complex because the source Wikipedia data doesn't have links so the decorator hasn't worked. I could add the Wikidata ID to the scraper by looking up the text we're getting from Wikipedia using the Wikidata API. That doesn't seem like the right thing to do because it feels like we're adding data to the Wikipedia page, rather than comparing what's there.

So instead I've compared the party text from the scraper with the wikidata party shortname attached to the person's member of the 19th bundestag position held property. The problem with this is that a couple of parties don't have short names and again, adding them just for this doesn't seem like the right thing to do.

tmtmtmtm commented 6 years ago

Woot. This is coming along well.

Constituency was straightforward as the ID decorator had done it's magic on the scraper.

I'm not sure this is as straightforward as you think: the large number of diffs here aren't simply because Wikipedia is missing lots of information. The people in Wikipedia who don't have any entry in the walkreis column are List MPs, rather than constituency MPs.

Parliamentary group is a bit more complex because the source Wikipedia data doesn't have links so the decorator hasn't worked.

This is a little more indirect than if the entries in the membership table had had links, but the data is on the page, so the scraper does already know the IDs for the groups:

screen shot 2017-11-20 at 07 18 26

What I would do in a case like this is gather up those into a Hash of string → Wikidata ID, and then have your party_wikidata field look that up

tmtmtmtm commented 6 years ago

Ah, no, on closer investigation, there's still a mismatch here. The membership table lists the party, but those tables are to the faction. Looks like you'll actually need to use the Fraktionsvorstände table:

screen shot 2017-11-20 at 07 39 04

Then you can use the tiny coloured bars in each row (here and in the membership table) as your common lookup. So even more indirection than the current version, but hopefully should actually be slightly simpler, as you're not relying on text comparison :)

henare commented 6 years ago

Thanks for the hints @tmtmtmtm :sparkles:

Ah, no, on closer investigation, there's still a mismatch here. The membership table lists the party, but those tables are to the faction. Looks like you'll actually need to use the Fraktionsvorstände table:

That still lists the party group/coalition/faction, not the specific party. In fact I've searched the whole decorated page and I'm pretty sure that specific parties aren't decorated onto the page anywhere :slightly_frowning_face:

(More discussion on https://github.com/everypolitician-scrapers/germany-bundestag-19-members-wikipedia/issues/1)

tmtmtmtm commented 6 years ago

Looks like I sent you on a bit of a wild goose chase there, by giving mixed messages. The field you should be scraping is the parliamentary group/faction, not the party, which is why I was directing you to the lookup tables for that. I'm not quite sure why I thought the first tables were actually the party rather than the group (CDU/CSU being separate vs combined is usually the obvious test for that), so sorry that that confused things :(

I should probably have also suggested that you rename your party fields to faction, as that might have made this clearer, but I'm so used to us using "party" as the generic term to mean either "parliamentary party" or "electoral party" that I hadn't even realised that that might make things more confusing too. (We tried for a while to standardise on "group", but as we're almost always going via Morph, and thus via SQL, calling a column group leads to its own awkwardness.)

henare commented 6 years ago

There are still two columns to fix up before we can call this done: faction and constituency.

tl;dr I think faction is done but I need a hand with constituency :raised_hands:

Faction

Thanks to Tony's help the scraper is now getting the faction for each member. In the prompter this shows that basically every WD entry needs updating from the WD party entity to the corresponding faction.

So there's lots of problems the prompter has identified but they're real problems that need to be fixed. This means we can call this done for the prompter.

Constituency

Our current constituency comparison shows 383 differences. These are largely where the constituency is "missing" from the CSV. However as Tony noted above it's because they're party list members, not directly elected members.

Since we're trying to correct data in WD we need to decide what we expect to see for party list members in WD. It looks like WD currently has the specific constituency the list member represents listed. This matches up with what you see on the Bundestag official site. If that's the case then the Wikipedia list can't help us because it doesn't list a constituency for these members,

Note: For the directly elected candidates, the corresponding constituency and the first vote share (in%) are listed. For the candidates selected via the state lists, only the state of their country list is specified. Should they also have run for office in a constituency as a direct candidate and have been defeated there by an applicant, then this electoral constituency as well as the initial vote share achieved there is not listed.

I'm also not sure of a way to make the constituency comparison only look for directly elected MPs. From what I can see WD doesn't know if the person was directly elected or elected from the party list.

Where should we go from here? Is there a useful constituency comparison we can still do with this page or is the Wikipedia list not useful in this case for our purposes?