Open marfi opened 8 years ago
Hi @marfi
I'm afraid that sort_name
isn't really going to be suitable for that. As this isn't a field that's easy to automatically generate without a lot of information about each country or language, we're largely reliant on being able to extract it from an existing source somewhere.
So for the German example, we get this from the tables on Wikipedia, such as https://de.wikipedia.org/wiki/Liste_der_Mitglieder_des_Deutschen_Bundestages_(18._Wahlperiode)#Abgeordnete
Each row there has a hidden field with a sort name: for example
<span style="display:none;">Aken, Jan van</span>
I believe that in German, the traditional sort order is to treat accented characters as if they were un-accented, rather than as if they were separate letters (one of the reasons why it's very difficult for us to generate this sort of data ourselves). I'm not sure why they sometimes end with !, etc., but largely we simply have to just follow what's there.
However, if your primary use for this is to split out family names and given names, you might be a lot better off using the JSON rather than (or in addition to) the CSVs. A large number of the records there will have separate fields for these there (we only expose a small subset of the fields into the CSV):
{
"birth_date": "1916-05-13",
"death_date": "2005-02-22",
"family_name": "Müller",
"gender": "male",
"given_name": "Adolf",
"id": "c371060d-ced3-4dc6-bf0e-48acd83f8d1d",
"name": "Adolf Müller",
"other_names": [
…
}
I am trying to figure a way to extract first and last names from name and/or sort_name. The logic I applied to Belgium: Get first name from name and remove it from sort_name and what is left from sort_name is family name does not work for Cyprus for example where name = sort_name in exactly the same order
In the case of Cyprus, I couldn't programmatically split first from last names, so I didn't bother. If a name's made up of only two words, you can safely assume the first is the surname and the second the given name. If it's for a good cause, I can go through all of them manually. For a bit of trivia, most Cypriot Greek surnames are either patronymics or patrial names that shed the particle under mainland Greek influence. For instance:
Now, uhh ... where were we?
Thanks @wfdd,
With rare exceptions you are right, most of the names are only of two words:
Dimitriou Misiaouli Stella as well as:
As these are both women, I think they kept their maiden names.
Don't do them manually, I am handling them by splits for now.
Your assumption is correct. The maiden name is the second one. There's fewer exceptions than I recalled :)
On 15 May 2016, at 08:52, Martin Linkov notifications@github.com wrote:
Thanks @wfdd,
With rare exceptions you are right:
Dimitriou Misiaouli Stella as well as:
As these are both women, I think they kept their maiden names.
Don't do them manually, I am handling them by splits for now.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub
For quite some time I am trying to wrap my mind around sort_names.
While in Belgium it appears that they are family name first, then first name with this one exception: https://github.com/everypolitician/everypolitician-data/blob/master/data/Germany/
(there are much more exceptions of the above kind in the UK data: https://github.com/everypolitician/everypolitician-data/blob/master/data/UK/Commons/term-55.csv#L12 and then L14 and more)
In Germany the things look different:
1) I don't understand the need of "!" at the end of the string or why at some lines it is spaced: https://github.com/everypolitician/everypolitician-data/blob/master/data/Germany/Bundestag/term-18.csv#L359 2) Special characters are not accounted for: ü > u: https://github.com/everypolitician/everypolitician-data/blob/master/data/Germany/Bundestag/term-18.csv#L33 3) Names are skipped: https://github.com/everypolitician/everypolitician-data/blob/master/data/Germany/Bundestag/term-18.csv#L36 4) Dashes are not preserved: https://github.com/everypolitician/everypolitician-data/blob/master/data/Germany/Bundestag/term-18.csv#L68 5) This I completely don't understand: https://github.com/everypolitician/everypolitician-data/blob/master/data/Germany/Bundestag/term-18.csv#L106
I am trying to figure a way to extract first and last names from name and/or sort_name. The logic I applied to Belgium: Get first name from name and remove it from sort_name and what is left from sort_name is family name does not work for Cyprus for example where name = sort_name in exactly the same order
Any help here will be greatly appreciated.