clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

Adding metadata for persons #860

Closed TomazErjavec closed 4 months ago

TomazErjavec commented 6 months ago

Now that we have metadata for organisations (#859), it is only proper that we have it for persons as well. The idea is to, as for orgs, produce two metadata TSV files, one in the original language, the other in English, which give the basic (mostly time independent) information on persons.

TomazErjavec commented 6 months ago

Added script and resulting files, @matyaskopp, maybe just have a quick look if you find any problems.

matyaskopp commented 5 months ago

listPerson

listOrg

TomazErjavec commented 5 months ago

I like the <FROM>/<TO> separator, but I think the missing <FROM> or <TO> value should be - instead of an empty string (2019-06-20/ vs 2019-06-20/-). When both are missing then probably -/- should be used.

OK, that's the way I did it now & we have a new function for this https://github.com/clarin-eric/ParlaMint/blob/49d21f9793c5c0124a9d48dbaf2faf5ac1367374/Scripts/parlamint-lib.xsl#L908

I don't like ;<SPACE> separator, I think | without space is easier to process (and more inspired by conllu format)

I personally don't like pipe, as a) it meas "or" and not "and" (as it should here) and b) if you use it in a regex context, you have problems, because you have to escape it. That said, we do use it in vertical files anyway, and for consistency, might as well use it here too. So, now pipe is used in these TSVs. Note that it is used consistently, i.e. even in contexts where one might not expect it, like when a person has two surnames: Pérez|Abellás

long dates can be trimmed 2017-10-21T14:00:00

This was done already, but only for some dates - now done everywhere.

IT contains duplicate values in URIs, so the question is whether some sorting and deduplication shouldn't be done... https://github.com/clarin-eric/ParlaMint/blob/7b52ae5a0cc57c2464b0372d6275f3135de59dac/Samples/ParlaMint-IT/ParlaMint-IT-listPerson.xml#L4322-L4323

I don't think the job of this script is fixing errors in source, so, no, won't implement this.

missing organization IDs column - impossible/difficult to link with ParlaMint-listOrg.tsv

Even if you have organisation IDs, you are still missing a lot of info about the affiliation. But, better some info than none, so added two columns, one for membership with politicalParty and the other with parliamentarGroup, as the two most interesting ones. They also have the date range, so you get e.g. VB[2004-11-14/2007-05-02]|Vlaams Blok[2003-06-05/2004-11-13]

Note that the affiliation are not sorted by date (it would be very difficult) and that I use the abbreviated name of the org if it exists. This is more readable than the ID, on the other hand, it might be more difficult to match with listOrg, so I'm not quite sure that this was a good decisions. @matyaskopp, what do you think?

listOrg

  • if CHES ID is multivalue, then use the same separator as in listPerson (| ??)

This probably belongs in #859, but, yes, I changed it to pipe now was well.

TomazErjavec commented 5 months ago

After discussion with @matyaskopp, we came to the decision that the following should be changed:

TomazErjavec commented 5 months ago

Now implemented the above. The only difference is that on looking at the various roles, I added a few more to "member" and "representative", namely https://github.com/clarin-eric/ParlaMint/blob/f336fc6d38a609316396c3cda2edf391481882e6/Build/Scripts/listPerson-tei2tsv.xsl#L117-L118

@matyaskopp, if ParlaMint-listPerson.tsv and ParlaMint-listPerson-en.tsv seem ok, you can close this.

Note that the current ParlaMint-listPerson-en.tsv is generated from Source-TEI data, i.e. it has more affiliation than would the one from Disto/, as in this directory the adjecent affiliations have not yet been merged.

matyaskopp commented 5 months ago

I don't think that head and deputyHead should be there because they implcate member role. So currently there are duplicities (or almost duplicities - first membership and next day voting and reaching deputy head status).

   <person xml:id="EvaDecroix.1982">
      <persName>
         <surname>Decroix</surname>
         <forename>Eva</forename>
      </persName>
<!-- SKIPPING  -->
      <affiliation ref="#parliamentaryGroup.ODS" role="deputyHead" from="2021-12-13T00:00:00" ana="#parliamentaryGroup.ODS.1538">
         <roleName xml:lang="cs">1. místopředseda</roleName>
         <roleName xml:lang="en">Deputy Head</roleName>
      </affiliation>
<!-- SKIPPING  -->
      <affiliation ref="#parliamentaryGroup.ODS" role="member" from="2021-10-12T00:00:00" ana="#parliamentaryGroup.ODS.1538"/>
      <affiliation ref="#parliament.PSP" role="member" from="2021-10-09T14:00:00" ana="#parliament.PSP9"/>
      <affiliation ref="#politicalParty.SPOLU" role="representative" from="2021-10-09">
         <roleName xml:lang="cs">Reprezentant</roleName>
         <roleName xml:lang="en">Representative</roleName>
      </affiliation>
   </person>

result:

CZ  EvaDecroix.1982 Decroix Eva Eva Decroix -   -   2021-10-09/-    ODS#parliamentaryGroup.ODS[2021-12-13/-]|ODS#parliamentaryGroup.ODS[2021-10-12/-]   SPOLU#politicalParty.SPOLU[2021-10-09/-]    F   1982-05-26  -   -   -   -   -   -   https://www.psp.cz/sqw/detail.sqw?id=6727
TomazErjavec commented 5 months ago

I don't think that head and deputyHead should be there

OK, sorry, removed.

TomazErjavec commented 4 months ago

I think this has all been implemented, so closing.