Closed TomazErjavec closed 1 year ago
The script trans-execute.pl now makes transliteration of the relevant elements in listPerson and listOrg. It first extracts the text content of these elements, transliterates them using Perl's Lingua::Translit
and then inserts the translations into the file.
Currently it transliterates only elements that do not have an English or already transliterated equivalent. Arguably, maybe all elements (regardless if they have the English "translation") should be transliterated.
It also reports when elements marked for the corpus language contain ASCI letters.
The script needs to incorporated either into the pre-processing of the corpora (like factorisation is) or into the parlamint2distro.pl script.
The script needs to incorporated either into the pre-processing of the corpora (like factorisation is)
This has now been done, looks like it works, at least GR reported no errors and has the transliterated elements in listOrg and listPerson, e.g.
<orgName full="abb">Ν.Δ.</orgName>
<orgName full="abb" xml:lang="el-Latn">N.D.</orgName>
and
<persName>
<surname>ΝΙΚΟΛΟΠΟΥΛΟΣ</surname>
<forename>ΝΙΚΟΛΑΟΣ</forename>
</persName>
<persName xml:lang="el-Latn">
<surname>Nikolopoylos</surname>
<forename>Nikolaos</forename>
</persName>
So, closing.
BG, GR and UA use non-Latin scripts, but not (all) of their metadata in listPerson and listOrg is translated to English, nor is it transliterated to Latin. This leads to the situation where ParlaMint-en has all the transcriptions in English, but e.g. the names of the speakers are in the Greek or Cyrillic script, making the corpora less useful for people not speaking the languages and esp. not being familiar with the script, or not having keyboards to input the names. This issue attempts to resolve this problem.