clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
50 stars 53 forks source link

Transliteration of non-Latin metadata #732

Closed TomazErjavec closed 1 year ago

TomazErjavec commented 1 year ago

BG, GR and UA use non-Latin scripts, but not (all) of their metadata in listPerson and listOrg is translated to English, nor is it transliterated to Latin. This leads to the situation where ParlaMint-en has all the transcriptions in English, but e.g. the names of the speakers are in the Greek or Cyrillic script, making the corpora less useful for people not speaking the languages and esp. not being familiar with the script, or not having keyboards to input the names. This issue attempts to resolve this problem.

TomazErjavec commented 1 year ago

The script trans-execute.pl now makes transliteration of the relevant elements in listPerson and listOrg. It first extracts the text content of these elements, transliterates them using Perl's Lingua::Translit and then inserts the translations into the file. Currently it transliterates only elements that do not have an English or already transliterated equivalent. Arguably, maybe all elements (regardless if they have the English "translation") should be transliterated. It also reports when elements marked for the corpus language contain ASCI letters.

The script needs to incorporated either into the pre-processing of the corpora (like factorisation is) or into the parlamint2distro.pl script.

TomazErjavec commented 1 year ago

The script needs to incorporated either into the pre-processing of the corpora (like factorisation is)

This has now been done, looks like it works, at least GR reported no errors and has the transliterated elements in listOrg and listPerson, e.g.

      <orgName full="abb">Ν.Δ.</orgName>
      <orgName full="abb" xml:lang="el-Latn">N.D.</orgName>

and

      <persName>
         <surname>ΝΙΚΟΛΟΠΟΥΛΟΣ</surname>
         <forename>ΝΙΚΟΛΑΟΣ</forename>
      </persName>
      <persName xml:lang="el-Latn">
         <surname>Nikolopoylos</surname>
         <forename>Nikolaos</forename>
      </persName>

So, closing.