clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

improve init-taxonomy script #679

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

Improve script for initializing taxonomies: https://github.com/clarin-eric/ParlaMint/blob/ce00fda7c8210f3cd7a709d8fa77998aac6708b4/Scripts/parlamint-init-taxonomy.xsl#L1-L9

If translation for a particular term, desc or catDesc exists, it is included in initialized taxonomy.

matyaskopp commented 1 year ago

@TomazErjavec ParlaMint/Scripts/parlamint-init-taxonomy.xsl can be used for taxonomy normalization (languages order: en, other langs in alphabetical order) and to make sure everything is translated in common taxonomy.

this creates normalized common taxonomies in Data/ParlaMint-TESTTAXONOMY

mkdir Data/ParlaMint-TESTTAXONOMY
make initTaxonomies-TESTTAXONOMY \
          PARLIAMENTS="TESTTAXONOMY" \
          LANG-CODE-LIST="bg bs ca cs da de el en es es es et eu fi fr fr gl hr hu is it lt lv nl nl no pl pt ro ru sl sr sv tr uk"

if translation is missing, then the taxonomy is invalid, because the terms do not contain text:

<?xml version="1.0" encoding="UTF-8"?>
<taxonomy xmlns="http://www.tei-c.org/ns/1.0"
          xml:id="ParlaMint-taxonomy-parla.legislature"
          xml:lang="mul">
   <desc xml:lang="en">
      <term>Legislature</term>
   </desc>
   <desc xml:lang="bg">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="bs">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="ca">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="cs">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="da">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="de">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="el">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="es">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="es">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="es">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="et">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="eu">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="fi">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="fr">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="fr">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="gl">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="hr">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="hu">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="is">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="it">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="lt">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="lv">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="nl">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="nl">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="no">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="pl">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="pt">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="ro">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="ru">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="sl">
      <term>Zakonodajna oblast</term>
   </desc>
   <desc xml:lang="sr">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="sv">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="tr">
      <term><!--Legislature--></term>
   </desc>
   <desc xml:lang="uk">
      <term><!--Legislature--></term>
   </desc>
TomazErjavec commented 1 year ago

Thanks for the explanation. I'm not sure if you are aware (I certainly forgot) that we have something similar already, i.e. parlamint-merge-taxonomy

So, I am not quite sure about the usage scenario of one versus the other. However, this can be resolved post 3.0.

matyaskopp commented 1 year ago

Scenario:

Following errors then can arise:

An invalid sample is better motivation to fix it than some error in the log. Process merge-init is repeated until taxonomy is not ok.