clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
50 stars 53 forks source link

FR: special speaker types #800

Closed TomazErjavec closed 1 year ago

TomazErjavec commented 1 year ago

The FR corpus also has speaker type (so, values of u/@ana) "government" and "unknown" but these are not in the common speaker types taxonomy, so processing FR results in errors. What can be done:

  1. Ignore these errors, with resulting vert and tsv files have no information for speaker type for such speakers
  2. Extend the common taxonomy with these two types but they will be used for FR only and won't have any translations except French
  3. Change these speaker types in the FR corpus to "guest", as government is in fact guest, and unknown most likely to be so

Personally I am in favour of 3, as time till V4 is short, and we can do something better in the future. Thoughts?

PS: Note that the term for unkown is currently wrong anyway:

<category xml:id="government">
   <catDesc xml:lang="fr"><term>Gouvernement</term> : membre du gouvernement invité en séance</catDesc>
   <catDesc xml:lang="en"><term>Government</term>: member of government invited at a meeting</catDesc>
</category>
<category xml:id="unknown">
   <catDesc xml:lang="fr"><term>Gouvernement</term> : type d'orateur inconnu</catDesc>
   <catDesc xml:lang="en"><term>Government</term>: : unknown type of speaker</catDesc>
</category>
matyaskopp commented 1 year ago

3. Change these speaker types in the FR corpus to "guest", as government is in fact guest, and unknown most likely to be so

No government members/representants should be regular. from documentation:

Note that we used the #regular values not only for MPs but for all other speakers that can regularly speak in a parliament, e.g. ministers, the MP, members of parlimentary commissions etc.

so I suggest changing the following:

in future we can add a subcategory of regular: government

TomazErjavec commented 1 year ago

No government members/representants should be regular.

I assume above is a typo, "No" should be deleted.

suggest changing the following:

OK, done.

in future we can add a subcategory of regular: government

And a few others too, like interrupting or vice-chair. Closing this, will open another issue once and if we start looking at these categories again.