clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
50 stars 53 forks source link

GR: listPerson and listOrg elements wrongly marked for language #735

Closed TomazErjavec closed 1 year ago

TomazErjavec commented 1 year ago

In connection with #732 the program complained about the strings below:

ERROR: "HOLLANDE FRANÇOIS" in persName marked as el but script is (also) Latin, fixing language to el-Latn
ERROR: "K.K.E." in orgName abb marked as el but script is (also) Latin, fixing language to el-Latn

I guess the persName of the first one should be marked with xml:lang="fr" (or, in a way even better xml:lang="en"), and the second one with xml:lang="en" or possibly xml:lang="el-Latn").

You could send the corrected ParlaMint-GR-listPerson.xml and ParlaMint-GR-listOrg.xml or just write here how to correct this.

Note also (cf. #732) that the new transliteration program will add the transliteration of all relevant elements to el-Latn using ISO 843. If you would prefer to do it yourself, maybe using some other transliteration scheme, you are of course welcome to do it, although it would need to be done soon.

DimitrisGk-iel commented 1 year ago

The organization name is a typo. Instead of using the Greek letters i used the English ones. It will be fixed at our next PR.

For the person name will this be enough?

HOLLANDE FRANÇOIS
TomazErjavec commented 1 year ago

The organization name is a typo. Instead of using the Greek letters i used the English ones. It will be fixed at our next PR.

OK, great, thanks.

For the person name will this be enough?

Given that the English will also write it this way, it would be simpler if your used "en" for the langauge code. But if you want to be a purist, you can use "fr", but then also pls. add French in your langUsage here: https://github.com/clarin-eric/ParlaMint/blob/0dfbeb729f258a114e29b82faca9e847ed4c51b6/Samples/ParlaMint-GR/ParlaMint-GR.xml#L129-L134 (and the same in ParlaMint-GR.ana.xml)

Also, pls. fix in langUsage the English names of the langauges in so they start with a capital letter, i.e.

<language ident="el" xml:lang="en">Greek</language>
<language ident="en" xml:lang="en">English</language>

(I forgot to ask you this, as I already fixed it locally, but if you will be changing the root files, it is best you fix this as well.)

TomazErjavec commented 1 year ago

This has been fixed, however, "HOLLANDE FRANÇOIS" was annotated like this:

      <persName>
         <surname xml:lang="fr">HOLLANDE</surname>
         <forename xml:lang="fr">FRANÇOIS</forename>
      </persName>

I know I wrote this is ok, but without thinking it through. It turns out that due to the way the transliteration script works this is not a good solution. Also not the best for economy of encoding. So, I have localy changed this (in both .TEI and .TEI.ana) to:

      <persName xml:lang="fr">
         <surname>Hollande</surname>
         <forename>François</forename>
      </persName>

Note that I have also changed the capitalisation; the reason is that all other corpora have the standard capital-casing of names, except for Greek, and it would be nice if it were compatible at least for the non-el names (e.g. if Hollande was also speaking in some other parliament).

I will close this issue, but @DimitrisGk-iel, could you please make the same change at your side. Sorry for the added work!