clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

PL feedback #573

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

very few government/parliament members

Does your sample contain all persons or just a sample? You have only 16 government affiliations that correspond to one term in Poland, I guess. The same situation is with parliament member: 218, and Senat: 7.

wrong timespan in title, bibl, and setting

We expect content up to 2022-06. https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL.xml#L8-L9

        <title type="sub" xml:lang="pl">Sprawozdania stenograficzne Sejmu i Senatu RP (2015-2020)</title>
        <title type="sub" xml:lang="en">Minutes of the Sejm and Senat of the Republic of Poland (2015-2020)</title>

https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL.xml#L71

          <date from="2015-11-12" to="2020-08-14">12.11.2015 - 14.08.2020</date>

https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL.xml#L430

      <settingDesc>
        <setting>
          <name type="address">ul. Wiejska 4/6/8</name>
          <name type="city">Warszawa</name>
          <name type="country" key="PL">Poland</name>
          <date from="2015-11-06" to="2020-08-18">12.11.2015 - 14.8.2020</date>
        </setting>
      </settingDesc>

missing terms events in parliament organizations

https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL-listOrg.xml#L7-L12

  <org xml:id="parliament.Sejm" role="parliament" ana="#parla.national #parla.lower">
    <orgName full="yes" xml:lang="pl">Sejm</orgName>
  </org>
  <org xml:id="parliament.Senat" role="parliament" ana="#parla.national #parla.upper">
    <orgName full="yes" xml:lang="pl">Senat</orgName>
  </org>

proper source link (nice to have)

The proper link to your sample file is: https://www.sejm.gov.pl/sejm8.nsf/wypowiedz.xsp?posiedzenie=5&dzien=2&wyp=0 https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL_2015-12-16-sejm-05-2.xml#L70

missing speaker notes

https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL_2015-12-16-sejm-05-2.xml#L119

        <note type="speaker">Marszałek:</note> <!-- this notetypes are missing in your data -->
        <u who="#MarekKuchciński" ana="#chair" xml:id="ParlaMint-PL_2015-12-16-sejm-05-2.u1">
          <seg xml:id="seg1">Wznawiam posiedzenie.</seg>
          <seg xml:id="seg2">Na sekretarzy dzisiejszych obrad powołuję posłów Krzysztofa Kubowa oraz Artura Sobonia.</seg>
          <seg xml:id="seg3">Protokół i listę mówców prowadzić będzie pan poseł Artur Soboń.</seg>
          <seg xml:id="seg4">Proszę pana posła sekretarza o odczytanie komunikatów.</seg>
        </u>

image

strange note annotation

https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL_2015-12-16-sejm-05-2.xml#L157

<note type="debate">Chwila ciszy</note>

speech in ()

https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL_2015-12-16-sejm-05-2.xml#L161

        <u who="#AndrzejJaworski" ana="#guest" xml:id="ParlaMint-PL_2015-12-16-sejm-05-2.u6">
          <seg xml:id="seg25">(Dziękuję.)</seg>
        </u>

speech with interruptions

There are a lot of speeches of this kind in your sample

https://github.com/mrudolf/ParlaMint/blob/76ce1e341db4d231e3ffbd2ac76c1767b7fde8cf/Data/ParlaMint-PL/ParlaMint-PL_2015-12-16-sejm-05-2.xml#L478-L500

        <u who="#BorysBudka" ana="#regular" xml:id="ParlaMint-PL_2015-12-16-sejm-05-2.u56">
          <seg xml:id="seg113">Poseł Borys Budka:</seg>
          <seg xml:id="seg114">Tymczasem pan marszałek po raz kolejny dopuszcza do rzeczy karygodnej.</seg>
          <vocal type="speaking">
            <desc>Siadaj!</desc>
          </vocal>
          <seg xml:id="seg115">Wprowadza do porządku obrad projekt ustawy...</seg>
          <seg xml:id="seg116">Marszałek:</seg>
          <seg xml:id="seg117">Panie pośle, zwracam panu uwagę, że pan nie występuje w trybie wniosku formalnego.</seg>
          <seg xml:id="seg118">Poseł Borys Budka:</seg>
          <seg xml:id="seg119">...o zmianie ustawy o Trybunale Konstytucyjnym, którego do teraz nie ma w drukach sejmowych.</seg>
          <seg xml:id="seg120">Marszałek:</seg>
          <seg xml:id="seg121">Do pan posła Borysa Budki. Zwracam panu uwagę, że zakłóca pan obrady Sejmu w tej chwili.</seg>
          <seg xml:id="seg122">Poseł Borys Budka:</seg>
          <seg xml:id="seg123">Ile razy jeszcze pan marszałek złamie regulamin Sejmu?</seg>
          <seg xml:id="seg124">Marszałek:</seg>
          <seg xml:id="seg125">Panie pośle, proszę pana, żeby opuścił pan mównicę.</seg>
          <seg xml:id="seg126">Poseł Borys Budka:</seg>
          <seg xml:id="seg127">Ile razy jeszcze państwo to złamiecie? Bardzo proszę o odpowiedź na to pytanie.</seg>
          <kinesic type="applause">
            <desc>Oklaski</desc>
          </kinesic>
        </u>

should be encoded this way:

<note type="speaker">Poseł Borys Budka:</note>
<u who="#BorysBudka" ana="#regular" xml:id="ParlaMint-PL_2015-12-16-sejm-05-2.u56">
  <seg xml:id="seg114">Tymczasem pan marszałek po raz kolejny dopuszcza do rzeczy karygodnej.</seg>
  <vocal type="speaking">
    <desc>Siadaj!</desc>
  </vocal>
  <seg xml:id="seg115">Wprowadza do porządku obrad projekt ustawy...</seg>
</u>
<note type="speaker">Marszałek:</note>
<u who="..." ana="#chair" ... >
  <seg xml:id="seg117">Panie pośle, zwracam panu uwagę, że pan nie występuje w trybie wniosku formalnego.</seg>
</u>
<note type="speaker">Poseł Borys Budka:</note>
<u who="#BorysBudka" ana="#regular" ...>
  <seg xml:id="seg119">...o zmianie ustawy o Trybunale Konstytucyjnym, którego do teraz nie ma w drukach sejmowych.</seg>
</u>
<!-- skipping rest -->
<kinesic type="applause"> <!-- moving trailink notes outside utterance -->
  <desc>Oklaski</desc>
</kinesic>

huge amount of L2 syntax errors

This error is possible to fix by replacing such relation with root

mrudolf commented 1 year ago

Not sure how am I supposed to mark the issues listed here done with all the issues in one GitHub issue. So, I will slowly fix them and will let you decide when to close it.

A few comments:

TomazErjavec commented 1 year ago
* very few government/parliament members – the sample contains only the persons from the sample.

Actually, it should contain all the persons (and orgs), even though they are not used in the sample. In this was we can Git-validate at least the root files completely, and the GitHub samples are good for computing statistics on the whole-corpus meta-data. This was the same in V1.

* wrong timespan – is 2022-06 the cutoff or should I add later sessions if I have them?

No, they should stop at cutoff.

* speech with interruptions – we are having all the sessions proofread. This should fix most of such problems, but I expect this to be available only by the end of January. Is that OK?

That is tight but maybe doable. Why don't we continue with debugging and when all is ok with the samples, submitting the complete PL, and once that validates, see if this corrected version becomes available in time.

* huge amount of L2 syntax errors – I get the analysis from external tool which I am not able to modify. Is the whole analysis wrong or can this be solved by some kind of automatic changes?

Matyaš already answered this:

This error is possible to fix by replacing such relation with root

matyaskopp commented 1 year ago

@mrudolf Nice work, thanks. There is only one issue remaining: setting has wrong timespan: https://github.com/mrudolf/ParlaMint/blob/fc88bc78892680c2c182339b7459656182729199/Data/ParlaMint-PL/ParlaMint-PL.xml#L430

        <setting>
          <name type="address">ul. Wiejska 4/6/8</name>
          <name type="city">Warszawa</name>
          <name type="country" key="PL">Poland</name>
          <date from="2015-11-06" to="2020-08-18">12.11.2015 - 14.8.2020</date>
        </setting>
mrudolf commented 1 year ago

Thanks, I was sure I fixed that, but apparently I wasn't.

I would do it and then, once I reprocess a few files, should I submit the whole corpus to Tomaž?

Wiadomość napisana przez Matyáš Kopp @.***> w dniu 17.03.2023, o godz. 20:23:

@mrudolf https://github.com/mrudolf Nice work, thanks. There is only one issue remaining: setting has wrong timespan: https://github.com/mrudolf/ParlaMint/blob/fc88bc78892680c2c182339b7459656182729199/Data/ParlaMint-PL/ParlaMint-PL.xml#L430

    <setting>
      <name type="address">ul. Wiejska 4/6/8</name>
      <name type="city">Warszawa</name>
      <name type="country" key="PL">Poland</name>
      <date from="2015-11-06" to="2020-08-18">12.11.2015 - 14.8.2020</date>
    </setting>

— Reply to this email directly, view it on GitHub https://github.com/clarin-eric/ParlaMint/issues/573#issuecomment-1474303762, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAS5RAKWDCICXCLM37XYF7DW4S24PANCNFSM6AAAAAATNPJ4JE. You are receiving this because you were mentioned.

— Michał Rudolf

matyaskopp commented 1 year ago

I would do it and then, once I reprocess a few files, should I submit the whole corpus to Tomaž?

exactly

TomazErjavec commented 1 year ago

& precisely :) looking forward to it.