Closed TomazErjavec closed 4 months ago
Added script and resulting files, @matyaskopp, maybe just have a quick look if you find any problems.
listPerson
<FROM>/<TO>
separator, but I think the missing <FROM>
or <TO>
value should be -
instead of an empty string (2019-06-20/
vs 2019-06-20/-
). When both are missing then probably -/-
should be used.;<SPACE>
separator, I think |
without space is easier to process (and more inspired by conllu format)2017-10-21T14:00:00
ParlaMint-listOrg.tsv
listOrg
|
??)I like the
<FROM>/<TO>
separator, but I think the missing<FROM>
or<TO>
value should be-
instead of an empty string (2019-06-20/
vs2019-06-20/-
). When both are missing then probably-/-
should be used.
OK, that's the way I did it now & we have a new function for this https://github.com/clarin-eric/ParlaMint/blob/49d21f9793c5c0124a9d48dbaf2faf5ac1367374/Scripts/parlamint-lib.xsl#L908
I don't like
;<SPACE>
separator, I think|
without space is easier to process (and more inspired by conllu format)
I personally don't like pipe, as a) it meas "or" and not "and" (as it should here) and b) if you use it in a regex context, you have problems, because you have to escape it. That said, we do use it in vertical files anyway, and for consistency, might as well use it here too. So, now pipe is used in these TSVs. Note that it is used consistently, i.e. even in contexts where one might not expect it, like when a person has two surnames: Pérez|Abellás
long dates can be trimmed
2017-10-21T14:00:00
This was done already, but only for some dates - now done everywhere.
IT contains duplicate values in URIs, so the question is whether some sorting and deduplication shouldn't be done... https://github.com/clarin-eric/ParlaMint/blob/7b52ae5a0cc57c2464b0372d6275f3135de59dac/Samples/ParlaMint-IT/ParlaMint-IT-listPerson.xml#L4322-L4323
I don't think the job of this script is fixing errors in source, so, no, won't implement this.
missing organization IDs column - impossible/difficult to link with
ParlaMint-listOrg.tsv
Even if you have organisation IDs, you are still missing a lot of info about the affiliation. But, better some info than none, so added two columns, one for membership with politicalParty and the other with parliamentarGroup, as the two most interesting ones. They also have the date range, so you get e.g.
VB[2004-11-14/2007-05-02]|Vlaams Blok[2003-06-05/2004-11-13]
Note that the affiliation are not sorted by date (it would be very difficult) and that I use the abbreviated name of the org if it exists. This is more readable than the ID, on the other hand, it might be more difficult to match with listOrg, so I'm not quite sure that this was a good decisions. @matyaskopp, what do you think?
listOrg
- if CHES ID is multivalue, then use the same separator as in listPerson (
|
??)
This probably belongs in #859, but, yes, I changed it to pipe now was well.
After discussion with @matyaskopp, we came to the decision that the following should be changed:
abbrev[date-range]
we shojuld have abbrev#id[date-range]
so that people can both read the name of the party as well as get its ID for linking with the orgList.tsv file@role=member
we should also take @role=representative
, which would better cover (at least) CZNow implemented the above. The only difference is that on looking at the various roles, I added a few more to "member" and "representative", namely https://github.com/clarin-eric/ParlaMint/blob/f336fc6d38a609316396c3cda2edf391481882e6/Build/Scripts/listPerson-tei2tsv.xsl#L117-L118
@matyaskopp, if ParlaMint-listPerson.tsv and ParlaMint-listPerson-en.tsv seem ok, you can close this.
Note that the current ParlaMint-listPerson-en.tsv is generated from Source-TEI data, i.e. it has more affiliation than would the one from Disto/, as in this directory the adjecent affiliations have not yet been merged.
I don't think that head
and deputyHead
should be there because they implcate member
role. So currently there are duplicities (or almost duplicities - first membership and next day voting and reaching deputy head status).
<person xml:id="EvaDecroix.1982">
<persName>
<surname>Decroix</surname>
<forename>Eva</forename>
</persName>
<!-- SKIPPING -->
<affiliation ref="#parliamentaryGroup.ODS" role="deputyHead" from="2021-12-13T00:00:00" ana="#parliamentaryGroup.ODS.1538">
<roleName xml:lang="cs">1. místopředseda</roleName>
<roleName xml:lang="en">Deputy Head</roleName>
</affiliation>
<!-- SKIPPING -->
<affiliation ref="#parliamentaryGroup.ODS" role="member" from="2021-10-12T00:00:00" ana="#parliamentaryGroup.ODS.1538"/>
<affiliation ref="#parliament.PSP" role="member" from="2021-10-09T14:00:00" ana="#parliament.PSP9"/>
<affiliation ref="#politicalParty.SPOLU" role="representative" from="2021-10-09">
<roleName xml:lang="cs">Reprezentant</roleName>
<roleName xml:lang="en">Representative</roleName>
</affiliation>
</person>
result:
CZ EvaDecroix.1982 Decroix Eva Eva Decroix - - 2021-10-09/- ODS#parliamentaryGroup.ODS[2021-12-13/-]|ODS#parliamentaryGroup.ODS[2021-10-12/-] SPOLU#politicalParty.SPOLU[2021-10-09/-] F 1982-05-26 - - - - - - https://www.psp.cz/sqw/detail.sqw?id=6727
I don't think that head and deputyHead should be there
OK, sorry, removed.
I think this has all been implemented, so closing.
Now that we have metadata for organisations (#859), it is only proper that we have it for persons as well. The idea is to, as for orgs, produce two metadata TSV files, one in the original language, the other in English, which give the basic (mostly time independent) information on persons.