clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

LT Feedback #620

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

Unique main title

Title should be unique in the corpus https://clarin-eric.github.io/ParlaMint/#exa-titleStmtComp

In the example it can be seen that the main title of a corpus component is simply an extension of the corpus root title, as it also gives the name of the particular meeting that the component contains, while the subordinate title is, again, free text. Both titles must be unique in the complete corpus.

https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT_1993-01-04-seimas-1-1.xml#L5-L6

      <titleStmt> <title type="main" xml:lang="lt">Lietuvos parlamento debatų tekstynas ParlaMint-LT, 1 eilinė sesija [ParlaMint]</title>
        <title type="main" xml:lang="en">Lithuanian parliamentary corpus ParlaMint-LT, Non Regular [ParlaMint]</title>

wrong corpus timespan

Corpus timespan in title:

<title type="sub" xml:lang="en">Transcripts of the Meetings of the Seimas of the Republic of Lithuania (1992-2022)</title>

bibl: https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT.xml#L66

            <bibl>
               <title type="main" xml:lang="lt">Lietuvos Respublikos Seimo posėdžių stenogramos</title>
               <title type="main" xml:lang="en">Transcripts of the Meetings of the Seimas of the Republic of Lithuania</title>
               <publisher>Lietuvos Seimo kanceliarija</publisher>
               <idno type="URI">https://www.lrs.lt/sip/portal.show?p_r=35727</idno>
               <date from="2012-11-16" to="2020-08-14">16.11.2012 - 14.08.2020</date>
            </bibl>

setting: https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT.xml#L127

            <setting>
               <name type="address">Gedimino pr. 53</name>
               <name type="city">Vilnius</name>
               <name key="LT" type="country">Lithuania</name>
               <date from="2012-11-16" to="2020-11-10">16.11.2012 - 10.11.2020</date>
            </setting

missing current governments

From the data, it seems that the last government ended on 2020-12-11, and a new one was not established. https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT-listOrg.xml#L70-L75

   <org role="government" xml:id="government.LT">
      <orgName full="yes" xml:lang="lt">Lietuvos Respublikos vyriausybė</orgName>
      <orgName full="yes" xml:lang="en">Government of the Lithuanian Republic</orgName>
      <orgName full="abb">Vyriausybė</orgName>
      <listEvent>
<!-- ... -->
         <event from="2016-11-22" to="2020-12-11" xml:id="government.LT.17">
            <label xml:lang="lt">17 Lietuvos Respublikos vyriausybė (2016-11-22 - 2020-12-11)</label>
            <label xml:lang="en">17 Government of the Lithuanian Republic (2016-11-22 - 2020-12-11)</label>
         </event>
      </listEvent>
   </org>

LT has unicameral system

https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT-listOrg.xml#L76

   <org ana="#parla.national #parla.lower" role="parliament" xml:id="S">

should be

   <org ana="#parla.national #parla.uni" role="parliament" xml:id="S">

I believe

to date in current term

Is it possible to have an early election in Lithuania? if yes, then to attribute should be removed: https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT-listOrg.xml#L120

         <event from="2020-11-13" to="2024-11-14" xml:id="S.9">

otherwise you can leave it as it is

to date in coallition/opposition

I suggest to remove to date in current coalition and opposition, because there can be changes in future: https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT-listOrg.xml#L1747-L1757

      <relation ana="#S.9"
                from="2020-11-13"
                mutual="#parliamentaryGroup.LF.1154 #parliamentaryGroup.LSF.8700 #parliamentaryGroup.TS_LKDF.1022"
                name="coalition"
                to="2024-11-14"/>
      <relation active="#parliamentaryGroup.DFVL.1322 #parliamentaryGroup.DPF.874 #parliamentaryGroup.LLRA_KSSF.1051 #parliamentaryGroup.LRF.577 #parliamentaryGroup.LSDDF.1098 #parliamentaryGroup.LSDPF.19928 #parliamentaryGroup.LVZSF.1070 #parliamentaryGroup.MSNG.20080"
                ana="#S.9"
                from="2020-11-13"
                name="opposition"
                passive="#government.LT"
                to="2024-11-14"/>

opposition is to the government

https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT-listOrg.xml#L1525-L1529

      <relation ana="#S.2"
                from="1992-11-24"
                mutual="#parliamentaryGroup.JSF.19922 #parliamentaryGroup.KDF.19923 #parliamentaryGroup.LTSF.19925 #parliamentaryGroup.LLSF.19927 #parliamentaryGroup.LSDPF.19928 #parliamentaryGroup.MSNG.19929 #parliamentaryGroup.NF.19931 #parliamentaryGroup.PCF.19932 #parliamentaryGroup.PKTLF.19933 #parliamentaryGroup.SCF.19934 #parliamentaryGroup.SF.19935 #parliamentaryGroup.SSF.19936 #parliamentaryGroup.TPF.19937 #parliamentaryGroup.TSKF.19938"
                name="opposition"
                to="1996-11-22"/>

should be:

      <relation ana="#S.2"
                from="1992-11-24"
                passive="#government.LT"
                active="#parliamentaryGroup.JSF.19922 #parliamentaryGroup.KDF.19923 #parliamentaryGroup.LTSF.19925 #parliamentaryGroup.LLSF.19927 #parliamentaryGroup.LSDPF.19928 #parliamentaryGroup.MSNG.19929 #parliamentaryGroup.NF.19931 #parliamentaryGroup.PCF.19932 #parliamentaryGroup.PKTLF.19933 #parliamentaryGroup.SCF.19934 #parliamentaryGroup.SF.19935 #parliamentaryGroup.SSF.19936 #parliamentaryGroup.TPF.19937 #parliamentaryGroup.TSKF.19938"
                name="opposition"
                to="1996-11-22"/>

note: there are multiple occurrences of this bug

affiliations that ends in future

Some affiliations end in future, and I guess to should be removed in these cases

      <affiliation ana="#S.9"
                   from="2020-11-13T00:00:00"
                   ref="#S"
                   role="member"
                   to="2024-11-14T00:00:00">
         <roleName xml:lang="lt">Narys</roleName>
         <roleName xml:lang="en">Member</roleName>
      </affiliation>

split multiple names

Multiple names are better to be split into multiple elements:

      <persName>
         <forename>Vilija</forename>
         <surname>Aleknaitė Abramikienė</surname>
      </persName>

should be

      <persName>
         <forename>Vilija</forename>
         <surname>Aleknaitė</surname>
         <surname>Abramikienė</surname>
      </persName>

abbreviated forename

   <person xml:id="VitkevičiusP">
      <persName>
         <forename>P.</forename>
         <surname>Vitkevičius</surname>
      </persName>
      <sex value="M"/>
   </person>

I suggest to use:

   <person xml:id="VitkevičiusP">
      <persName>
         <forename full="init">P</forename> <!-- @full +  dot removed -->
         <surname>Vitkevičius</surname>
      </persName>
      <sex value="M"/>
   </person>

@TomazErjavec, do you agree? This is the only corpus that has it, as far as I know. But I think it is good to indicate that P is not a forename (it is only the initial letter).

Another possibility is to reconstruct the forename from parliamentary proceedings. The speaker is usually mentioned in the preceding chairman's speech. (We used this attitude in ParlaMint-UA because there are a lot of guest speakers)

BTW is not he the same person as VitkevičiusPranciškusStanislavas

use correct dates in subcorpus taxonomy

https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-taxonomy-subcorpus.xml

<?xml version="1.0" encoding="UTF-8"?>
<taxonomy xmlns="http://www.tei-c.org/ns/1.0"
          xml:id="ParlaMint-taxonomy-subcorpus"
          xml:lang="mul">
<!--...-->
   <category xml:id="reference">
<!--...-->
      <catDesc xml:lang="en">
         <term>Reference</term>: reference subcorpus, until 2020-01-28</catDesc> <!- 2019-10-31-  -->
   </category>
   <category xml:id="covid">
<!--...-->
         <term>COVID</term>: COVID subcorpus, from 2020-03-10 onwards</catDesc> <!-- 2019-11-01 -->
   </category>
</taxonomy>

see:

https://github.com/clarin-eric/ParlaMint/blob/031ec3009386a4bfec60bf0e22f653a813ddf98c/Data/ParlaMint-CZ/ParlaMint-taxonomy-subcorpus.xml

bibl URL is referring to wrong source

https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT_1993-01-04-seimas-1-1.xml#L64

<idno type="URI">https://e-seimas.lrs.lt/portal/legalAct/lt/TAK/TAIS.462504</idno>

refers to sitting from 2013-12-17 image

Some utterances looks more like notes

https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT_2020-05-21-seimas-8-1.xml#L140

<u ana="#chair" who="#DegutienėIrena" xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1970">
          <seg xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1970.p2674">Opozicinių Tėvynės sąjungos-Lietuvos krikščionių demokratų frakcijos ir Lietuvos socialdemokratų partijos frakcijos darbotvarkė</seg>
          <seg xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1970.p2675">14.02 val.</seg>
          <seg xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1970.p2676">Diskusija &amp;quot;Kokias švietimo problemas atvėrė pandemijos krizė?&amp;quot;</seg>
        </u>

image

or https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT_2020-05-21-seimas-8-1.xml#L147-L149

        <u ana="#chair" who="#DegutienėIrena" xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1972">
          <seg xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1972.p2679">Vilniaus universiteto Ugdymo mokslų instituto profesorės habilituotos daktarės Vilijos Targamadzės kalba</seg>
        </u>

image

speaker note

speaker notes are missing: https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT_2020-05-21-seimas-8-1.xml#L150

        </u>
        <u ana="#regular" who="#TargamadzėVilija" xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1973">
          <seg xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1973.p2680">Laba diena. Mano kalbėjimas bus apie bendrojo ugdymo mokyklos problemas, nes mes sutarėme tam tikromis kryptimis kalbėti apie problemas, kurios išryškėjo pandemijos metu, bet tai nereiškia, kad nėra padaryta daug gerų ir prasmingų darbų. Tiesiog tema yra kita.</seg>

can be:

        </u>
<note type="speaker">V. TARGAMADZĖ.</note>
        <u ana="#regular" who="#TargamadzėVilija" xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1973">
          <seg xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1973.p2680">Laba diena. Mano kalbėjimas bus apie bendrojo ugdymo mokyklos problemas, nes mes sutarėme tam tikromis kryptimis kalbėti apie problemas, kurios išryškėjo pandemijos metu, bet tai nereiškia, kad nėra padaryta daug gerų ir prasmingų darbų. Tiesiog tema yra kita.</seg>

trailing and leading notes

Trailing and leading notes should be outside utterances: https://clarin-eric.github.io/ParlaMint/#para-hierarchy-comments

Apart heads and gaps, transcriber comments are encoded using the element or one of several so called ‘incident’ elements, as explained below. These elements can be placed directly inside <div>, <u>, <seg> or even <s> in the linguistically annotated version. They should be placed as far up the hierarchy as possible, ...

vaidasmo commented 1 year ago

Thank you for comments! We will work on correcting mistakes. Just two clarifications:

  1. @TomazErjavec, do you agree? This is the only corpus that has it, as far as I know. But I think it is good to indicate that P is not a forename (it is only the initial letter).

Another possibility is to reconstruct the forename from parliamentary proceedings. The speaker is usually mentioned in the preceding chairman's speech. (We used this attitude in ParlaMint-UA because there are a lot of guest speakers)

BTW is not he the same person as VitkevičiusPranciškusStanislavas

We agree on your suggestion with regard to marking abbreviated names. However, identifying whether P.Vitkevičius is the same person as VitkevičiusPranciškusStanislavas would be too difficult if at all possible. Especially, in the older debates. Speakers of Seimas do announce the names of guest speakers, but in the transcripts they are abbreviated.

  1. Trailing and leading notes should be outside utterances: https://clarin-eric.github.io/ParlaMint/#para-hierarchy-comments

Do you have in mind any specific places where this is not implemented correctly, or this is a general remark for keeping in mind?

Vaidas

matyaskopp commented 1 year ago

Do you have in mind any specific places where this is not implemented correctly, or this is a general remark for keeping in mind?

https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT_1996-11-05-seimas-2-1.xml#L469-L474

        <u ana="#chair" who="#JuršėnasČeslovas" xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u466">
          <seg xml:id="ParlaMint-LT_1996-11-05-seimas-2-1.u466.p538">Ačiū, gerbiamasis pranešėjau. Mielieji kolegos, ar galim bendru sutarimu pritarti pateikimui? Prašau. Ar galim bendru sutarimu? Tada prašau, vienas - už, vienas - prieš. Iš eilės. Kolega B.Rupeika. Ar pritariat pateikimui, ar ne?</seg>
          <vocal type="noise">
            <desc>Balsai salėje</desc>
          </vocal>
        </u>

And also this note, that was not recognized should be outside <seg>: https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT_1996-11-05-seimas-2-1.xml#L498

<seg xml:id="ParlaMint-LT_1996-11-05-seimas-2-1.u474.p547">Salėje net 96 Seimo nariai. Nepanašu, bet manykim. (Salėje šurmulys)</seg>

should be:

<seg xml:id="ParlaMint-LT_1996-11-05-seimas-2-1.u474.p547">Salėje net 96 Seimo nariai. Nepanašu, bet manykim.</seg>
<vocal type="noise">
  <desc>Salėje šurmulys</desc>
<vocal>

But this note: https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT_1996-11-05-seimas-2-1.xml#L523

<seg xml:id="ParlaMint-LT_1996-11-05-seimas-2-1.u478.p555">Aš <!--...-->. Kas už tai... (Balsas salėje) Kitas <!--...--> balsuoti.</seg>

It should be inside because it is in the middle of the paragraph - separated by spaces:

<seg xml:id="ParlaMint-LT_1996-11-05-seimas-2-1.u478.p555">Aš <!--...-->. Kas už tai... <vocal type="noise">
  <desc>Balsas salėje</desc>
</vocal> Kitas <!--...--> balsuoti.</seg>
TomazErjavec commented 1 year ago

I agree with (just) having the initial, also for the encoding, except that I would leave the dot, I don't see a good reason to remove it (and TEI has it this way too). So:

<forename full="init">P.</forename> 
vaidasmo commented 1 year ago

@matyaskopp and @TomazErjavec thank you both for clarifications!

matyaskopp commented 1 year ago

@vaidasmo , @mindpetk, can you please update your sample? I will then check if everything is fixed.

mindpetk commented 1 year ago

@vaidasmo , @mindpetk, can you please update your sample? I will then check if everything is fixed.

I've uploaded a new Sample with the fixes. Hopefully, it fixes all the issues.

matyaskopp commented 1 year ago

I am not sure about corpus timespan: https://github.com/clarin-eric/ParlaMint/pull/610/files/5bff893e47533c3e9543d13f1f35a380ef3776d2..f8b4846df8da54451aec6ca0d548400f4964edd4#diff-908dc1331ad5ac255c89e559b957707d63865bc208c2ec0c5b21413477db5bd5R68

<date from="1993-01-04" to="2021-12-23">04.01.1993 - 23.12.2021</date>

You are supposed to deliver up to mid. 2022, but the timeframe in title, bibl and setting is up to 2021-12-23. Is this the sample timeframe or the timeframe of the whole corpora?

vaidasmo commented 1 year ago

This needs to be updated as our corpus will span till 2022-12-23. Vaidas

matyaskopp commented 1 year ago

Speakers

matyaskopp commented 1 year ago

Not recognized notes

https://github.com/mindpetk/ParlaMint/blob/f8b4846df8da54451aec6ca0d548400f4964edd4/Data/ParlaMint-LT/ParlaMint-LT_1996-11-05-seimas-2-1.xml#L339

<seg xml:id="ParlaMint-LT_1996-11-05-seimas-2-1.u402.p451">Ar <!-- 
... 
--> Vaišnoras? (Balsai salėje) Aš  <!-- 
... 
-->  tekstu? (Triukšmas salėje) Gerai.  <!-- 
... 
-->  Pronckau... (Balsai salėje) V.Bulovas  <!-- 
... 
--> </seg>
matyaskopp commented 1 year ago

Init forename

I agree with (just) having the initial, also for the encoding, except that I would leave the dot, I don't see a good reason to remove it (and TEI has it this way too). So:

<forename full="init">P.</forename> 

Sorry for not confirming @TomazErjavec suggestion. He is the top dog, and giving it a second thought I agree with him.

mindpetk commented 1 year ago

Apologies for not thoroughly checking my files. I've pushed a new update that should fix the errors.

matyaskopp commented 1 year ago

@mindpetk Thanks for the quick fixings. A few (hopefully last) notes.

chairman notes

I haven't expected you to invent new notes, I expect you to preserve the ones that are in the text: image https://github.com/mindpetk/ParlaMint/blob/352f818080cd9175e7ed0388d664bf26c60c2900/Data/ParlaMint-LT/ParlaMint-LT_2020-05-21-seimas-8-1.xml#L128

        <note type="speaker">I. DEGUTIENĖ.</note>
        <u ana="#chair" who="#DegutienėIrena" xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1867">
          <seg xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1867.p2523">Gerbiami kolegos, pradedame 2020 m. gegužės 21 d. vakarinį posėdį.

I suggest encoding this in this way:

        <note type="speaker">PIRMININKĖ (I. DEGUTIENĖ, TS-LKDF).</note>
        <u ana="#chair" who="#DegutienėIrena" xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1867">
          <seg xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1867.p2523">Gerbiami kolegos, pradedame 2020 m. gegužės 21 d. vakarinį posėdį.

incidents

It is better to preserve spaces around incidents (not sure if your tokenization tool does correctly sentence segmentation when a new line is inside of a sentence) https://github.com/mindpetk/ParlaMint/blob/352f818080cd9175e7ed0388d664bf26c60c2900/Data/ParlaMint-LT/ParlaMint-LT_2020-05-21-seimas-8-1.xml#L130-L132

          <seg xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1867.p2523">Gerbiami kolegos, pradedame 2020 m. gegužės 21 d. vakarinį posėdį.
            <vocal type="noise">
              <desc xml:lang="lt">Gongas</desc></vocal>Registruojamės.</seg>

better use:

<seg xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1867.p2523">Gerbiami kolegos, pradedame 2020 m. gegužės 21 d. vakarinį posėdį. <vocal type="noise">
    <desc xml:lang="lt">Gongas</desc>
</vocal> Registruojamės.</seg>

(notes placement also discussed here: https://github.com/clarin-eric/ParlaMint/issues/621#issuecomment-1476852833)

session number

https://github.com/mindpetk/ParlaMint/blob/352f818080cd9175e7ed0388d664bf26c60c2900/Data/ParlaMint-LT/ParlaMint-LT_2020-05-21-seimas-8-1.xml#L11

        <meeting ana="#parla.uni #parla.term #S.8" corresp="#S" n="8">8 kadencija</meeting>
        <meeting ana="#parla.uni #parla.session #S.8" corresp="#S" n="1"> 8 eilinė sesija </meeting>
        <meeting ana="#parla.uni #parla.meeting.regular" n="1">1 posėdis</meeting>

should be

        <meeting ana="#parla.uni #parla.term #S.8" corresp="#S" n="8">8 kadencija</meeting>
<!-- remove event that correspond to term + fix @n value: -->
        <meeting ana="#parla.uni #parla.session" corresp="#S" n="8"> 8 eilinė sesija </meeting>
        <meeting ana="#parla.uni #parla.meeting.regular" n="1">1 posėdis</meeting>

Linguistic annotation

you are processing parts of the xml inside linguistic annotation: https://github.com/mindpetk/ParlaMint/blob/352f818080cd9175e7ed0388d664bf26c60c2900/Data/ParlaMint-LT/ParlaMint-LT_1993-01-04-seimas-1-1.xml#L394-L396

          <seg xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.u64.p74">Taigi 75 Seimo nariams balsavus už, 3 balsavus prieš ir 16 susilaikius, Seimo nutarimas &amp;quot;Dėl Lietuvos Respublikos Valstybės kontrolieriaus&amp;quot; priimtas.
            <vocal type="noise">
              <desc xml:lang="lt">Plojimai</desc></vocal>Prisijungiu prie plojimų ir dar sykį sveikinu Vidą Kundrotą, jau kaip Lietuvos Respublikos Valstybės kontrolierių. Sėkmingo darbo. Ačiū.</seg>

after linguistic annotations (removed attributes, preserving tokens):

<seg xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.u64.p74">
  <s xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.seg74.1">
    <w>Taigi</w>
    <w>75</w>
    <name type="ORG">
      <w>Seimo</w>
    </name>
    <w>nariams</w>
    <w>balsavus</w>
    <w>už</w>
    <pc>,</pc>
    <w>3</w>
    <w>balsavus</w>
    <w>prieš</w>
    <w>ir</w>
    <w>16</w>
    <w>susilaikius</w>
    <pc>,</pc>
    <name type="ORG">
      <w>Seimo</w>
    </name>
    <w>nutarimas</w>
    <w>&amp;amp;</w>
    <w>amp;quot;Dėl</w>
    <name type="MISC">
      <w>Lietuvos</w>
      <w>Respublikos</w>
    </name>
    <w>Valstybės</w>
    <w>kontrolieriaus&amp;amp;amp;quot</w>
    <pc>;</pc>
    <w>priimtas.&amp;lt;vocal</w>
    <w>type=&amp;quot;noise&amp;quot;&amp;gt;&amp;lt;desc</w>
    <pc>xml</pc>
    <pc>:</pc>
    <w>lang=&amp;quot;lt&amp;quot;&amp;gt;Plojimai&amp;lt;/desc&amp;gt;&amp;lt;/vocal</w>
    <pc>&amp;gt;</pc>
    <w>Prisijungiu</w>
    <w>prie</w>
    <w>plojimų</w>
    <w>ir</w>
    <w>dar</w>
    <w>sykį</w>
    <w>sveikinu</w>
    <name type="PER">
      <w>Vidą</w>
      <w>Kundrotą</w>
    </name>
    <linkGrp targFunc="head argument" type="UD-SYN"> <!-- ... --> </linkGrp>
  </s>
  <s xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.seg74.2"> <!-- ... --> </s>
  <s xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.seg74.3"> <!-- ... --> </s>
  <s xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.seg74.4"> <!-- ... --> </s>
</seg>
matyaskopp commented 1 year ago

double xml entity escaping &amp;quot;

https://github.com/mindpetk/ParlaMint/blob/352f818080cd9175e7ed0388d664bf26c60c2900/Data/ParlaMint-LT/ParlaMint-LT_1993-01-04-seimas-1-1.xml#L134

<seg xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.u2.p5">Suprasdamas <!--... --> tvarkos.&amp;quot; Pasirašo A.Endriukaitis.</seg>

should be

<seg xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.u2.p5">Suprasdamas <!--... --> tvarkos.&quot; Pasirašo A.Endriukaitis.</seg>

or not to escape it inside text at all (easiest/safest way)

<seg xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.u2.p5">Suprasdamas <!--... --> tvarkos." Pasirašo A.Endriukaitis.</seg>

note that it breaks linguistic annotation:

<w lemma="tvarkos.&amp;amp;amp;quot" 
   msd="UPosTag=NOUN|Case=Gen|Gender=Masc|Number=Sing"
   xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.seg5.1.24">tvarkos.&amp;amp;amp;quot</w>
<pc msd="UPosTag=PUNCT" xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.seg5.1.25">;</pc>
mindpetk commented 1 year ago

I've uploaded an updated version of the Sample.

Incidents incidents It is better to preserve spaces around incidents (not sure if your tokenization tool does correctly sentence...

The XML parser keeps putting <vocal type="noise"> on a new line, thus removing the last space. Other than that, everything else about the new sample should be fixed.

matyaskopp commented 1 year ago

@mindpetk sorry for the delay...

I don't know what tool you are using. In XSLT:

<xsl:preserve-space elements="s seg catDesc"/>

In Perl package XML::LibXML::PrettyPrint:

  my $pp = XML::LibXML::PrettyPrint->new(
     element => {
        preserves_whitespace => [qw/s seg catDesc/],
        }
    );

Other tools will be similar - search for preserve in the documentation.

matyaskopp commented 1 year ago

@mindpetk there are still notes that should be placed outside elements.

When s or seg or u start/end with note/incident, then note/incident should be moved to the parent element (bubble up in ancestor axis).

I am thinking about implementing a one-purpose script that solves it because this is quite a common mistake...

TomazErjavec commented 1 year ago

@mindpetk there are still notes that should be placed outside elements. Maybe we should try for this in 3.1 but for now just leave it? I'm not sure nobody has them anyway...

When s or seg or u start/end with note/incident, then note/incident should be moved to the parent element (bubble up in ancestor axis).

Nicely put. Interestingly, gap doesn't do this.

I am thinking about implementing a one-purpose script that solves it because this is quite a common mistake...

That would be of course great. And it could be included in the finalize script.