clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

Merging of taxonomies #722

Closed TomazErjavec closed 1 year ago

TomazErjavec commented 1 year ago

We want to have common taxonomies (in Corpora/Taxonomies) with all the translations included. The particular corpora should then get their taxonomies from the common ones. The planned workflow is:

  1. Make and run script to gather all the translations from the taxonomies of the corpora, and store then in the common taxonomies; also report missing translations and esp. report corpus-specific categories, which are not part of the common taxonomies
  2. Decide whether to add the corpus-specific categories to the common taxonomies, or to correct the corpora (either automatically or ask the partners to do it)
  3. Make script to take the common taxonomies and write them to the corpus directories, keeping only English and corpus language , keeping only English and corpus-langauge(s) descriptions
  4. Ask the partners to add the missing translations to the taxonomies now stored with their corpus
  5. Re-merge and re-export, with no warnings and esp. errors
TomazErjavec commented 1 year ago

The parlamint-merge-taxonomy.xsl script is ready, here is how it is currently run: https://github.com/clarin-eric/ParlaMint/blob/2cdddce10e563167f329b8275253625a2860b86f/Corpora/Makefile#L1-L20

TomazErjavec commented 1 year ago

We now need to decide what to do with corpus specific categories:

ParlaMint-taxonomy-parla.legislature:

https://github.com/clarin-eric/ParlaMint/blob/2cdddce10e563167f329b8275253625a2860b86f/Corpora/Taxonomies/ParlaMint-taxonomy-merge.log#L342-L345

ParlaMint-taxonomy-speaker_types:

https://github.com/clarin-eric/ParlaMint/blob/2cdddce10e563167f329b8275253625a2860b86f/Corpora/Taxonomies/ParlaMint-taxonomy-merge.log#L115-L118

TomazErjavec commented 1 year ago

ERROR: ParlaMint-IS contains non-standard category parla.sittinig for taxonomy ParlaMint-taxonomy-parla.legislature ERROR: ParlaMint-IT contains non-standard category parla.meetining.public for taxonomy ParlaMint-taxonomy-parla.legislature

sittinig and meetining are typos, and parla.sittinig and parla.meetining.public are in fact not used in IS and IT at all. So, it is enough to correct the source taxonomy, and re-run the taxonomy-merge script and the problem will go away. For the next round IS, IT (and everybody else) should take the merged/split taxonomies as their input anyway.

ERROR: ParlaMint-IS contains non-standard category parla.unif for taxonomy ParlaMint-taxonomy-parla.legislature

This one is defined, on the same level as upper and lower house, as:

<category xml:id="parla.unif">
  <catDesc xml:lang="is"><term>Sameinað þing</term></catDesc>
  <catDesc xml:lang="en"><term>Unified Chamber</term></catDesc>
</category>

However, it is never used in the IS corpus, so I suggest we simply delete it, @starkadur, is this ok with you?

TomazErjavec commented 1 year ago

This is now operational, closing.