Closed matyaskopp closed 1 year ago
I am not sure about this taxonomy ParlaMint-taxonomy-CHES.xml, it is a lot of work to translate it... Should we allow only the English version?
Yes, absolutely, I wasn't thinking that we should translate it. Note also that ParlaMint-taxonomy-politicalOrientation.xml is not in fact yet finished (explanations for the more bizzare orientations are missing).
Note also that ParlaMint-taxonomy-politicalOrientation.xml is not in fact yet finished (explanations for the more bizzare orientations are missing).
Should I start with generating taxonomies, or shall I wait for the final ParlaMint-taxonomy-politicalOrientation.xml (#729)?
I can prepare scripts, and the generation can be done later...
Should I start with generating taxonomies, or shall I wait for the final ParlaMint-taxonomy-politicalOrientation.xml (https://github.com/clarin-eric/ParlaMint/issues/729)?
I now added the explanations (4af3cd5) so I think it is finished. Maybe one more merge...
@matyaskopp, I think you can close this after one more merge of dev.
@matyaskopp, as regards parlamint-init-taxonomy.xsl, on reflection I think it is a bad idea to:
- if translation for certain language missing then empty translation (in comment is stored an english origin)
because, what happens, as I'm sure it will, that not all partners will translate all the terms? Then we are left with empty terms for the language, but all the scripts will think they actually have a translation, and will put an empty string (e.g. in the vertical or meta-files) for that category. Yes, scripts could be made aware of this danger, but it would be mean making them even more complicated and making changes to a large number of places.
So, I'd suggest you just put the English text in the not-yet-translated category descriptions. The parterns will notice this, and translate them, if they want to. If not, we are still left with the wrongly-marked-for-language, but at least existing term and description.
OK, there were another few changes in CHES and orientation taxonomies, so they need a dev merge to main. Also, the Scripts/parlamint2distro.pl has been changes so it uses parlamint-init-taxonomy.xsl for inserting common taxonomies to the corpus directories.
So, I'd suggest you just put the English text in the not-yet-translated category descriptions. The parterns will notice this, and translate them, if they want to.
Also, currently, there are funny comments (and missing target/@ref
in the empty translations, this is for BE which has "fr nl" as the languages:
<category xml:id="orientation.SY">
<catDesc xml:lang="en"><term>Syncretic politics</term>: <ref target="https://en.wikipedia.org/wiki/Syncretic_politics">Syncretic politics<\
/ref> refers to politics that combine elements from across the conventional left–right political spectrum.</catDesc>
<catDesc xml:lang="fr"><term><!--Syncretic politics--></term>: <!----><ref><!--Syncretic politics--></ref>
<!-- refers to politics that combine elements from across the conventional left–right political spectrum.-->
</catDesc>
<catDesc xml:lang="nl"><term><!--Syncretic politics--></term>: <!----><ref><!--Syncretic politics--></ref>
<!-- refers to politics that combine elements from across the conventional left–right political spectrum.-->
</catDesc>
</category>
For BE there is the other problem that they have fr, nl as their languages, but FR and NL also have the same languages.
So, which translations get chosen, and why? I would propose we just ignore BE translations (i.e. by removing them from the source taxonomies), so we don't get a conflict or, worse, two translations into the same language. Probably not ideal, as the translation could be different in FR/fr and BE/fr (or NR/nl and BE/nl) as the parliamentary systems are different, but it would complicate everything too much to have several translations for the same language.
Also, the Scripts/parlamint2distro.pl has been changes so it uses parlamint-init-taxonomy.xsl for inserting common taxonomies to the corpus directories.
I agree with using this script in the parlamint2distro, but I have to add one more parameter to the script - How to treat with missing translations:
if-lang-missing = comment
current default behaviour will be - if the translation is missing, then place English commented equivalent - it will produce invalid taxonomy (term
should contain nonempty text), but it will force partners to translate it:<catDesc xml:lang="fr"><term><!--Syncretic politics--></term>: <ref><!--Syncretic politics--></ref>
<!-- refers to politics that combine elements from across the conventional left–right political spectrum.-->
</catDesc>
if-lang-missing = use-english
, then English text will be used and the comment(probably not XML - to show that the translation is missing) [missing yy translation]
will be added, eg:
<catDesc xml:lang="fr"><term>Syncretic politics [missing fr translation]</term>: <ref>Syncretic politics [missing fr translation]</ref>
refers to politics that combine elements from across the conventional left–right political spectrum. [missing fr translation]
</catDesc>
if-lang-missing = skip
- this will not include the language (should be default for taxonomies we are not translating)So, we will use the same script for the preparation of the new ParlaMint-XX corpus (we expect translations) and for the distribution (we expect valid taxonomy). @TomazErjavec do you agree?
For BE there is the other problem that they have fr, nl as their languages, but FR and NL also have the same languages.
So, which translations get chosen, and why? I would propose we just ignore BE translations (i.e. by removing them from the source taxonomies), so we don't get a conflict or, worse, two translations into the same language. Probably not ideal, as the translation could be different in FR/fr and BE/fr (or NR/nl and BE/nl) as the parliamentary systems are different, but it would complicate everything too much to have several translations for the same language.
this should be solved in taxonomy merging - we don't want in our shared common taxonomy multiple translations of different term
.
So, we will use the same script for the preparation of the new ParlaMint-XX corpus (we expect translations) and for the distribution (we expect valid taxonomy).
Nice! Yes, pls. go ahead and modify.
The next question is how exactly to ask the partners to insert translations. Maybe make a subdirectory in Taxonomies/, where the taxonomies for all partners are deposited, and then they pull request once translated? We also need to let them know they need a synch with repo.
this should be solved in taxonomy merging - we don't want in our shared common taxonomy multiple translations of different term.
Yes, this will be the best way indeed.
@TomazErjavec, taxonomy extraction is improved:
if-lang-missing
values are
comment
english translation is placed to commentuse-english
English text is used instead of xx langskip
, if translation is missing, then desc
and catDesc
are skipped for the languageThere is only one thing left (for you); it should be included in the distro script: https://github.com/clarin-eric/ParlaMint/blob/67d03d606fbdb5091429944df9bb1100416586d1/Scripts/parlamint2distro.pl#L350
The next question is how exactly to ask the partners to insert translations. Maybe make a subdirectory in Taxonomies/, where the taxonomies for all partners are deposited, and then they pull request once translated? We also need to let them know they need a synch with repo.
No, the best place for translations is the Sample/ParlaMint-XX directories because they are validated with GitHub action. I can overwrite taxonomies in Samples directories, and the partners will:
If valid, we will then:
There is only one thing left (for you); it should be included in the distro script
Thank you, done.
The next question is how exactly to ask the partners to insert translations. the best place for translations is the Sample/ParlaMint-XX directories because they are validated with GitHub action.
OK. Do you want to write a mail to them and ask them to do it? Or write a draft and send it to me, and I write, also about the other things (metadata, remaining issues)?
ParlaMint-BE
contains Czech translations
https://github.com/clarin-eric/ParlaMint/blob/0dfbeb729f258a114e29b82faca9e847ed4c51b6/Corpora/Taxonomies/ParlaMint-taxonomy-NER.ana.xml#L55-L57
I modified the script to use only one translation e9cf0597321ae477b39a60568d329ed17f4553b9.
If parlamint=ParlaMint-XX
is used
https://github.com/clarin-eric/ParlaMint/blob/18864e7da2163f6bdccd5b50968a9f6b1d0d9366/Scripts/parlamint-init-taxonomy.xsl#L19
https://github.com/clarin-eric/ParlaMint/blob/18864e7da2163f6bdccd5b50968a9f6b1d0d9366/Scripts/parlamint-init-taxonomy.xsl#L49
then n=ParlaMint-XX
is preferred. Otherwise, it uses XML-first translation.
ParlaMint-taxonomy-NER.ana.xml contains Czech translations
This was a mess but I think I fixed it now in 02ed791.
Actually it wasn't but maybe is in dcecff4.
@TomazErjavec, I need some help. What is the source for generating merged taxonomies?
There are missing texts in catDesc
, e.g. this is the taxonomy released with UA:
<?xml version="1.0" encoding="UTF-8"?>
<taxonomy xmlns="http://www.tei-c.org/ns/1.0" xml:id="ParlaMint-taxonomy-speaker_types" xml:lang="mul">
<desc xml:lang="uk"><term>Типи промовців</term></desc>
<desc xml:lang="en"><term>Types of speakers</term></desc>
<category xml:id="chair">
<catDesc xml:lang="uk"><term>головуючий</term>: головуючий на засіданні</catDesc>
<catDesc xml:lang="en"><term>Chairperson</term>: chairman of a sitting</catDesc>
</category>
<category xml:id="regular">
<catDesc xml:lang="uk"><term>регулярний</term>: народний депутат або представник уряду, який бере участь у засіданні</catDesc>
<catDesc xml:lang="en"><term>Regular</term>: a regular speaker at a sitting</catDesc>
</category>
<category xml:id="guest">
<catDesc xml:lang="uk"><term>гість</term>: промовець на засіданні, який не є народним депутатом або представником уряду</catDesc>
<catDesc xml:lang="en"><term>Guest</term>: a guest speaker at a sitting</catDesc>
</category>
</taxonomy>
But the merged taxonomy fragment is: https://github.com/clarin-eric/ParlaMint/blob/dcecff459df5c0bf65e2a1a2c322af81ee0c4d22/Corpora/Taxonomies/ParlaMint-taxonomy-speaker_types.xml#L292-L294
The merging script is implemented differently than I expected, so I need to fully understand how it should be used.
I expected a script for inserting new translations, where the input is valid taxonomy (only English at the beginning), and it is iteratively extended with new translations. I can implement it if your script does not simply support it.
@TomazErjavec, I need some help. What is the source for generating merged taxonomies?
I'm sorry that this is such a mess. The source (and it probably shouldn't be) is currently Corpora/Master/ParlaMint.xml and Corpora/Master/ParlaMint.ana.xml or, rather, the files that are XIncluded there (and they, in turn, XInclude the local taxonomies). The corpora themselves are not part of Git, but only on tantra.
The proper source should most likely be Corpora/Sources-TEI, and I should move it there. This directory has it's own makefile, where I fiddle with factorisation and adding metadata, anyway, rather a mess and I get lost myself there. Should find some time to fix it.
But for this:
There are missing texts in catDesc
thanks for spotting it! It was a completely silly bug, now fixed in c668bf3 (a <xsl:copy-of select="tei:*"/>
instead of <xsl:copy-of select="node()"/>
@TomazErjavec, I believe we can close this issue, if not, please reopen
One less, nice!
related to #722
Add pruned common taxonomies in the countries' folder:
Taxonomies that need translation:
Taxonomies without change (only English version):
@TomazErjavec, I am not sure about this taxonomy ParlaMint-taxonomy-CHES.xml, it is a lot of work to translate it... Should we allow only the English version?