clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

Localization - vertical files #794

Open matyaskopp opened 1 year ago

matyaskopp commented 1 year ago

I still need help with agreeing with the localization of vertical files. Sample from ParlaMint-BE

<speech id="ParlaMint-BE_2022-06-09-voorlopig-55-plenair-ip185x.u1" 
        text_id="ParlaMint-BE_2022-06-09-voorlopig-55-plenair-ip185x" 
        subcorpus="War" 
        lang="Multilingual" 
        body="Eerste Kamer" 
        term="55" 
        session="-" 
        meeting="ip185" 
        sitting="-" 
        agenda="-" 
        date="2022-06-09" 
        title="Belgisch parlementair corpus ParlaMint-BE, plenaire zitting van 09-06-2022" 
        speaker_role="Voorzitter" 
        speaker_id="TillieuxEliane" 
        speaker_name="Tillieux, Eliane" 
        speaker_mp="MP" 
        speaker_minister="notMinister" 
        speaker_party="PS" 
        speaker_party_name="Parti Socialiste" 
        party_status="Coalition" 
        party_orientation="Centre-left to left" 
        speaker_gender="F" 
        speaker_birth="1966">
...

I can see multiple problems:

  1. The corpus is partially translated so that the query will contain mixed languages en/nl in values
  2. The corpus is multilingual (fr/nl), so the user can expect French in values
  3. if someone decides to improve translations(use a different term / add missing translation) in future releases (ParlaMint 4/5 ??), then old queries will not work
  4. What is the plan for all-in-one (ParlaMint-XX) in noSkech? Will we use the English values?
TomazErjavec commented 1 year ago

I still need help with agreeing with the localization of vertical files.

In short: it isn't perfect but it is the first step. I think for ideopolitical reasons, if nothing else, the researchers in country XX looking at the parliament of XX deserve to have the metadata in their native language. And given that we have the most of the metadata in both en and xx, why not display it in xx?

That said:

I can see multiple problems:

1. The corpus is partially translated so that the query will contain mixed languages en/nl in values

True - but most is translated (or at least should be, depending on the partner), I think everything except for "Multilingual", "MP", "minister" and "F".

2. The corpus is multilingual (fr/nl), so the user can expect French in values

Yes, this is a limitation, I agree. Then again, at least for the concordancers, we probably wouldn't want to have two corpora for some countries with the only difference in the langauge of the metadata; or, maybe even worse, all the metadata available in two languages as separate attributes. Would get messy.

3. if someone decides to improve translations(use a different term / add missing translation) in future releases (ParlaMint 4/5 ??), then old queries will not work

If somebody changes the English term they won't work either. Anyway, thinking that only the version of the corpus can be changed and all the rest works isn't the case now either, e.g. between 2.1 and 3.0, and 3.0 and 4.0 attributes have changed.

4. What is the plan for all-in-one (ParlaMint-XX) in noSkech? Will we use the English values?

Yes.