clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

Prepare taxonomies for translation in countries folders #728

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

related to #722

@TomazErjavec, I am not sure about this taxonomy ParlaMint-taxonomy-CHES.xml, it is a lot of work to translate it... Should we allow only the English version?

TomazErjavec commented 1 year ago

I am not sure about this taxonomy ParlaMint-taxonomy-CHES.xml, it is a lot of work to translate it... Should we allow only the English version?

Yes, absolutely, I wasn't thinking that we should translate it. Note also that ParlaMint-taxonomy-politicalOrientation.xml is not in fact yet finished (explanations for the more bizzare orientations are missing).

matyaskopp commented 1 year ago

Note also that ParlaMint-taxonomy-politicalOrientation.xml is not in fact yet finished (explanations for the more bizzare orientations are missing).

Should I start with generating taxonomies, or shall I wait for the final ParlaMint-taxonomy-politicalOrientation.xml (#729)?

I can prepare scripts, and the generation can be done later...

TomazErjavec commented 1 year ago

Should I start with generating taxonomies, or shall I wait for the final ParlaMint-taxonomy-politicalOrientation.xml (https://github.com/clarin-eric/ParlaMint/issues/729)?

I now added the explanations (4af3cd5) so I think it is finished. Maybe one more merge...

TomazErjavec commented 1 year ago

@matyaskopp, I think you can close this after one more merge of dev.

TomazErjavec commented 1 year ago

@matyaskopp, as regards parlamint-init-taxonomy.xsl, on reflection I think it is a bad idea to:

  • if translation for certain language missing then empty translation (in comment is stored an english origin)

because, what happens, as I'm sure it will, that not all partners will translate all the terms? Then we are left with empty terms for the language, but all the scripts will think they actually have a translation, and will put an empty string (e.g. in the vertical or meta-files) for that category. Yes, scripts could be made aware of this danger, but it would be mean making them even more complicated and making changes to a large number of places.

So, I'd suggest you just put the English text in the not-yet-translated category descriptions. The parterns will notice this, and translate them, if they want to. If not, we are still left with the wrongly-marked-for-language, but at least existing term and description.

TomazErjavec commented 1 year ago

OK, there were another few changes in CHES and orientation taxonomies, so they need a dev merge to main. Also, the Scripts/parlamint2distro.pl has been changes so it uses parlamint-init-taxonomy.xsl for inserting common taxonomies to the corpus directories.

TomazErjavec commented 1 year ago

So, I'd suggest you just put the English text in the not-yet-translated category descriptions. The parterns will notice this, and translate them, if they want to.

Also, currently, there are funny comments (and missing target/@ref in the empty translations, this is for BE which has "fr nl" as the languages:

   <category xml:id="orientation.SY">
      <catDesc xml:lang="en"><term>Syncretic politics</term>: <ref target="https://en.wikipedia.org/wiki/Syncretic_politics">Syncretic politics<\
/ref> refers to politics that combine elements from across the conventional left–right political spectrum.</catDesc>
      <catDesc xml:lang="fr"><term><!--Syncretic politics--></term>: <!----><ref><!--Syncretic politics--></ref>
         <!-- refers to politics that combine elements from across the conventional left–right political spectrum.-->
      </catDesc>
      <catDesc xml:lang="nl"><term><!--Syncretic politics--></term>: <!----><ref><!--Syncretic politics--></ref>
         <!-- refers to politics that combine elements from across the conventional left–right political spectrum.-->
      </catDesc>
   </category>

For BE there is the other problem that they have fr, nl as their languages, but FR and NL also have the same languages.

So, which translations get chosen, and why? I would propose we just ignore BE translations (i.e. by removing them from the source taxonomies), so we don't get a conflict or, worse, two translations into the same language. Probably not ideal, as the translation could be different in FR/fr and BE/fr (or NR/nl and BE/nl) as the parliamentary systems are different, but it would complicate everything too much to have several translations for the same language.

matyaskopp commented 1 year ago

Also, the Scripts/parlamint2distro.pl has been changes so it uses parlamint-init-taxonomy.xsl for inserting common taxonomies to the corpus directories.

I agree with using this script in the parlamint2distro, but I have to add one more parameter to the script - How to treat with missing translations:

<catDesc xml:lang="fr"><term><!--Syncretic politics--></term>: <ref><!--Syncretic politics--></ref>
         <!-- refers to politics that combine elements from across the conventional left–right political spectrum.-->
      </catDesc>

So, we will use the same script for the preparation of the new ParlaMint-XX corpus (we expect translations) and for the distribution (we expect valid taxonomy). @TomazErjavec do you agree?

For BE there is the other problem that they have fr, nl as their languages, but FR and NL also have the same languages.

So, which translations get chosen, and why? I would propose we just ignore BE translations (i.e. by removing them from the source taxonomies), so we don't get a conflict or, worse, two translations into the same language. Probably not ideal, as the translation could be different in FR/fr and BE/fr (or NR/nl and BE/nl) as the parliamentary systems are different, but it would complicate everything too much to have several translations for the same language.

this should be solved in taxonomy merging - we don't want in our shared common taxonomy multiple translations of different term.

TomazErjavec commented 1 year ago

So, we will use the same script for the preparation of the new ParlaMint-XX corpus (we expect translations) and for the distribution (we expect valid taxonomy).

Nice! Yes, pls. go ahead and modify.

The next question is how exactly to ask the partners to insert translations. Maybe make a subdirectory in Taxonomies/, where the taxonomies for all partners are deposited, and then they pull request once translated? We also need to let them know they need a synch with repo.

this should be solved in taxonomy merging - we don't want in our shared common taxonomy multiple translations of different term.

Yes, this will be the best way indeed.

matyaskopp commented 1 year ago

@TomazErjavec, taxonomy extraction is improved:

There is only one thing left (for you); it should be included in the distro script: https://github.com/clarin-eric/ParlaMint/blob/67d03d606fbdb5091429944df9bb1100416586d1/Scripts/parlamint2distro.pl#L350


The next question is how exactly to ask the partners to insert translations. Maybe make a subdirectory in Taxonomies/, where the taxonomies for all partners are deposited, and then they pull request once translated? We also need to let them know they need a synch with repo.

No, the best place for translations is the Sample/ParlaMint-XX directories because they are validated with GitHub action. I can overwrite taxonomies in Samples directories, and the partners will:

If valid, we will then:

TomazErjavec commented 1 year ago

There is only one thing left (for you); it should be included in the distro script

Thank you, done.

The next question is how exactly to ask the partners to insert translations. the best place for translations is the Sample/ParlaMint-XX directories because they are validated with GitHub action.

OK. Do you want to write a mail to them and ask them to do it? Or write a draft and send it to me, and I write, also about the other things (metadata, remaining issues)?

matyaskopp commented 1 year ago

ParlaMint-BE contains Czech translations https://github.com/clarin-eric/ParlaMint/blob/0dfbeb729f258a114e29b82faca9e847ed4c51b6/Corpora/Taxonomies/ParlaMint-taxonomy-NER.ana.xml#L55-L57

I modified the script to use only one translation e9cf0597321ae477b39a60568d329ed17f4553b9.

If parlamint=ParlaMint-XX is used https://github.com/clarin-eric/ParlaMint/blob/18864e7da2163f6bdccd5b50968a9f6b1d0d9366/Scripts/parlamint-init-taxonomy.xsl#L19 https://github.com/clarin-eric/ParlaMint/blob/18864e7da2163f6bdccd5b50968a9f6b1d0d9366/Scripts/parlamint-init-taxonomy.xsl#L49

then n=ParlaMint-XX is preferred. Otherwise, it uses XML-first translation.

TomazErjavec commented 1 year ago

ParlaMint-taxonomy-NER.ana.xml contains Czech translations

This was a mess but I think I fixed it now in 02ed791.

TomazErjavec commented 1 year ago

Actually it wasn't but maybe is in dcecff4.

matyaskopp commented 1 year ago

@TomazErjavec, I need some help. What is the source for generating merged taxonomies? There are missing texts in catDesc, e.g. this is the taxonomy released with UA:

<?xml version="1.0" encoding="UTF-8"?>
<taxonomy xmlns="http://www.tei-c.org/ns/1.0" xml:id="ParlaMint-taxonomy-speaker_types" xml:lang="mul">
   <desc xml:lang="uk"><term>Типи промовців</term></desc>
   <desc xml:lang="en"><term>Types of speakers</term></desc>
   <category xml:id="chair">
      <catDesc xml:lang="uk"><term>головуючий</term>: головуючий на засіданні</catDesc>
      <catDesc xml:lang="en"><term>Chairperson</term>: chairman of a sitting</catDesc>
   </category>
   <category xml:id="regular">
      <catDesc xml:lang="uk"><term>регулярний</term>: народний депутат або представник уряду, який бере участь у засіданні</catDesc>
      <catDesc xml:lang="en"><term>Regular</term>: a regular speaker at a sitting</catDesc>
   </category>
   <category xml:id="guest">
      <catDesc xml:lang="uk"><term>гість</term>: промовець на засіданні, який не є народним депутатом або представником уряду</catDesc>
      <catDesc xml:lang="en"><term>Guest</term>: a guest speaker at a sitting</catDesc>
   </category>
</taxonomy>

But the merged taxonomy fragment is: https://github.com/clarin-eric/ParlaMint/blob/dcecff459df5c0bf65e2a1a2c322af81ee0c4d22/Corpora/Taxonomies/ParlaMint-taxonomy-speaker_types.xml#L292-L294

The merging script is implemented differently than I expected, so I need to fully understand how it should be used.

I expected a script for inserting new translations, where the input is valid taxonomy (only English at the beginning), and it is iteratively extended with new translations. I can implement it if your script does not simply support it.

TomazErjavec commented 1 year ago

@TomazErjavec, I need some help. What is the source for generating merged taxonomies?

I'm sorry that this is such a mess. The source (and it probably shouldn't be) is currently Corpora/Master/ParlaMint.xml and Corpora/Master/ParlaMint.ana.xml or, rather, the files that are XIncluded there (and they, in turn, XInclude the local taxonomies). The corpora themselves are not part of Git, but only on tantra.

The proper source should most likely be Corpora/Sources-TEI, and I should move it there. This directory has it's own makefile, where I fiddle with factorisation and adding metadata, anyway, rather a mess and I get lost myself there. Should find some time to fix it.

But for this:

There are missing texts in catDesc

thanks for spotting it! It was a completely silly bug, now fixed in c668bf3 (a <xsl:copy-of select="tei:*"/> instead of <xsl:copy-of select="node()"/>

matyaskopp commented 1 year ago

@TomazErjavec, I believe we can close this issue, if not, please reopen

TomazErjavec commented 1 year ago

One less, nice!