Closed matyaskopp closed 1 year ago
Thank you for comments! We will work on correcting mistakes. Just two clarifications:
@TomazErjavec, do you agree? This is the only corpus that has it, as far as I know. But I think it is good to indicate that P is not a forename (it is only the initial letter).
Another possibility is to reconstruct the forename from parliamentary proceedings. The speaker is usually mentioned in the preceding chairman's speech. (We used this attitude in ParlaMint-UA because there are a lot of guest speakers)
BTW is not he the same person as VitkevičiusPranciškusStanislavas
We agree on your suggestion with regard to marking abbreviated names. However, identifying whether P.Vitkevičius is the same person as VitkevičiusPranciškusStanislavas would be too difficult if at all possible. Especially, in the older debates. Speakers of Seimas do announce the names of guest speakers, but in the transcripts they are abbreviated.
Trailing and leading notes should be outside utterances: https://clarin-eric.github.io/ParlaMint/#para-hierarchy-comments
Do you have in mind any specific places where this is not implemented correctly, or this is a general remark for keeping in mind?
Vaidas
Do you have in mind any specific places where this is not implemented correctly, or this is a general remark for keeping in mind?
<u ana="#chair" who="#JuršėnasČeslovas" xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u466">
<seg xml:id="ParlaMint-LT_1996-11-05-seimas-2-1.u466.p538">Ačiū, gerbiamasis pranešėjau. Mielieji kolegos, ar galim bendru sutarimu pritarti pateikimui? Prašau. Ar galim bendru sutarimu? Tada prašau, vienas - už, vienas - prieš. Iš eilės. Kolega B.Rupeika. Ar pritariat pateikimui, ar ne?</seg>
<vocal type="noise">
<desc>Balsai salėje</desc>
</vocal>
</u>
And also this note, that was not recognized should be outside <seg>
:
https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT_1996-11-05-seimas-2-1.xml#L498
<seg xml:id="ParlaMint-LT_1996-11-05-seimas-2-1.u474.p547">Salėje net 96 Seimo nariai. Nepanašu, bet manykim. (Salėje šurmulys)</seg>
should be:
<seg xml:id="ParlaMint-LT_1996-11-05-seimas-2-1.u474.p547">Salėje net 96 Seimo nariai. Nepanašu, bet manykim.</seg>
<vocal type="noise">
<desc>Salėje šurmulys</desc>
<vocal>
<seg xml:id="ParlaMint-LT_1996-11-05-seimas-2-1.u478.p555">Aš <!--...-->. Kas už tai... (Balsas salėje) Kitas <!--...--> balsuoti.</seg>
It should be inside because it is in the middle of the paragraph - separated by spaces:
<seg xml:id="ParlaMint-LT_1996-11-05-seimas-2-1.u478.p555">Aš <!--...-->. Kas už tai... <vocal type="noise">
<desc>Balsas salėje</desc>
</vocal> Kitas <!--...--> balsuoti.</seg>
I agree with (just) having the initial, also for the encoding, except that I would leave the dot, I don't see a good reason to remove it (and TEI has it this way too). So:
<forename full="init">P.</forename>
@matyaskopp and @TomazErjavec thank you both for clarifications!
@vaidasmo , @mindpetk, can you please update your sample? I will then check if everything is fixed.
@vaidasmo , @mindpetk, can you please update your sample? I will then check if everything is fixed.
I've uploaded a new Sample with the fixes. Hopefully, it fixes all the issues.
I am not sure about corpus timespan: https://github.com/clarin-eric/ParlaMint/pull/610/files/5bff893e47533c3e9543d13f1f35a380ef3776d2..f8b4846df8da54451aec6ca0d548400f4964edd4#diff-908dc1331ad5ac255c89e559b957707d63865bc208c2ec0c5b21413477db5bd5R68
<date from="1993-01-04" to="2021-12-23">04.01.1993 - 23.12.2021</date>
You are supposed to deliver up to mid. 2022, but the timeframe in title
, bibl
and setting
is up to 2021-12-23.
Is this the sample timeframe or the timeframe of the whole corpora?
This needs to be updated as our corpus will span till 2022-12-23. Vaidas
<seg xml:id="ParlaMint-LT_1996-11-05-seimas-2-1.u402.p451">Ar <!--
...
--> Vaišnoras? (Balsai salėje) Aš <!--
...
--> tekstu? (Triukšmas salėje) Gerai. <!--
...
--> Pronckau... (Balsai salėje) V.Bulovas <!--
...
--> </seg>
I agree with (just) having the initial, also for the encoding, except that I would leave the dot, I don't see a good reason to remove it (and TEI has it this way too). So:
<forename full="init">P.</forename>
Sorry for not confirming @TomazErjavec suggestion. He is the top dog, and giving it a second thought I agree with him.
Apologies for not thoroughly checking my files. I've pushed a new update that should fix the errors.
@mindpetk Thanks for the quick fixings. A few (hopefully last) notes.
I haven't expected you to invent new notes, I expect you to preserve the ones that are in the text: https://github.com/mindpetk/ParlaMint/blob/352f818080cd9175e7ed0388d664bf26c60c2900/Data/ParlaMint-LT/ParlaMint-LT_2020-05-21-seimas-8-1.xml#L128
<note type="speaker">I. DEGUTIENĖ.</note>
<u ana="#chair" who="#DegutienėIrena" xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1867">
<seg xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1867.p2523">Gerbiami kolegos, pradedame 2020 m. gegužės 21 d. vakarinį posėdį.
I suggest encoding this in this way:
<note type="speaker">PIRMININKĖ (I. DEGUTIENĖ, TS-LKDF).</note>
<u ana="#chair" who="#DegutienėIrena" xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1867">
<seg xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1867.p2523">Gerbiami kolegos, pradedame 2020 m. gegužės 21 d. vakarinį posėdį.
It is better to preserve spaces around incidents (not sure if your tokenization tool does correctly sentence segmentation when a new line is inside of a sentence) https://github.com/mindpetk/ParlaMint/blob/352f818080cd9175e7ed0388d664bf26c60c2900/Data/ParlaMint-LT/ParlaMint-LT_2020-05-21-seimas-8-1.xml#L130-L132
<seg xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1867.p2523">Gerbiami kolegos, pradedame 2020 m. gegužės 21 d. vakarinį posėdį.
<vocal type="noise">
<desc xml:lang="lt">Gongas</desc></vocal>Registruojamės.</seg>
better use:
<seg xml:id="ParlaMint-LT_2020-05-21-seimas-8-1.u1867.p2523">Gerbiami kolegos, pradedame 2020 m. gegužės 21 d. vakarinį posėdį. <vocal type="noise">
<desc xml:lang="lt">Gongas</desc>
</vocal> Registruojamės.</seg>
(notes placement also discussed here: https://github.com/clarin-eric/ParlaMint/issues/621#issuecomment-1476852833)
<meeting ana="#parla.uni #parla.term #S.8" corresp="#S" n="8">8 kadencija</meeting>
<meeting ana="#parla.uni #parla.session #S.8" corresp="#S" n="1"> 8 eilinė sesija </meeting>
<meeting ana="#parla.uni #parla.meeting.regular" n="1">1 posėdis</meeting>
should be
<meeting ana="#parla.uni #parla.term #S.8" corresp="#S" n="8">8 kadencija</meeting>
<!-- remove event that correspond to term + fix @n value: -->
<meeting ana="#parla.uni #parla.session" corresp="#S" n="8"> 8 eilinė sesija </meeting>
<meeting ana="#parla.uni #parla.meeting.regular" n="1">1 posėdis</meeting>
you are processing parts of the xml inside linguistic annotation: https://github.com/mindpetk/ParlaMint/blob/352f818080cd9175e7ed0388d664bf26c60c2900/Data/ParlaMint-LT/ParlaMint-LT_1993-01-04-seimas-1-1.xml#L394-L396
<seg xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.u64.p74">Taigi 75 Seimo nariams balsavus už, 3 balsavus prieš ir 16 susilaikius, Seimo nutarimas &quot;Dėl Lietuvos Respublikos Valstybės kontrolieriaus&quot; priimtas.
<vocal type="noise">
<desc xml:lang="lt">Plojimai</desc></vocal>Prisijungiu prie plojimų ir dar sykį sveikinu Vidą Kundrotą, jau kaip Lietuvos Respublikos Valstybės kontrolierių. Sėkmingo darbo. Ačiū.</seg>
after linguistic annotations (removed attributes, preserving tokens):
<seg xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.u64.p74">
<s xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.seg74.1">
<w>Taigi</w>
<w>75</w>
<name type="ORG">
<w>Seimo</w>
</name>
<w>nariams</w>
<w>balsavus</w>
<w>už</w>
<pc>,</pc>
<w>3</w>
<w>balsavus</w>
<w>prieš</w>
<w>ir</w>
<w>16</w>
<w>susilaikius</w>
<pc>,</pc>
<name type="ORG">
<w>Seimo</w>
</name>
<w>nutarimas</w>
<w>&amp;</w>
<w>amp;quot;Dėl</w>
<name type="MISC">
<w>Lietuvos</w>
<w>Respublikos</w>
</name>
<w>Valstybės</w>
<w>kontrolieriaus&amp;amp;quot</w>
<pc>;</pc>
<w>priimtas.&lt;vocal</w>
<w>type=&quot;noise&quot;&gt;&lt;desc</w>
<pc>xml</pc>
<pc>:</pc>
<w>lang=&quot;lt&quot;&gt;Plojimai&lt;/desc&gt;&lt;/vocal</w>
<pc>&gt;</pc>
<w>Prisijungiu</w>
<w>prie</w>
<w>plojimų</w>
<w>ir</w>
<w>dar</w>
<w>sykį</w>
<w>sveikinu</w>
<name type="PER">
<w>Vidą</w>
<w>Kundrotą</w>
</name>
<linkGrp targFunc="head argument" type="UD-SYN"> <!-- ... --> </linkGrp>
</s>
<s xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.seg74.2"> <!-- ... --> </s>
<s xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.seg74.3"> <!-- ... --> </s>
<s xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.seg74.4"> <!-- ... --> </s>
</seg>
&quot;
<seg xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.u2.p5">Suprasdamas <!--... --> tvarkos.&quot; Pasirašo A.Endriukaitis.</seg>
should be
<seg xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.u2.p5">Suprasdamas <!--... --> tvarkos." Pasirašo A.Endriukaitis.</seg>
or not to escape it inside text at all (easiest/safest way)
<seg xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.u2.p5">Suprasdamas <!--... --> tvarkos." Pasirašo A.Endriukaitis.</seg>
note that it breaks linguistic annotation:
<w lemma="tvarkos.&amp;amp;quot"
msd="UPosTag=NOUN|Case=Gen|Gender=Masc|Number=Sing"
xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.seg5.1.24">tvarkos.&amp;amp;quot</w>
<pc msd="UPosTag=PUNCT" xml:id="ParlaMint-LT_1993-01-04-seimas-1-1.seg5.1.25">;</pc>
I've uploaded an updated version of the Sample.
Incidents incidents It is better to preserve spaces around incidents (not sure if your tokenization tool does correctly sentence...
The XML parser keeps putting <vocal type="noise">
on a new line, thus removing the last space.
Other than that, everything else about the new sample should be fixed.
@mindpetk sorry for the delay...
I don't know what tool you are using. In XSLT:
<xsl:preserve-space elements="s seg catDesc"/>
In Perl package XML::LibXML::PrettyPrint:
my $pp = XML::LibXML::PrettyPrint->new(
element => {
preserves_whitespace => [qw/s seg catDesc/],
}
);
Other tools will be similar - search for preserve
in the documentation.
@mindpetk there are still notes that should be placed outside elements.
When s
or seg
or u
start/end with note/incident, then note/incident should be moved to the parent element (bubble up in ancestor axis).
I am thinking about implementing a one-purpose script that solves it because this is quite a common mistake...
@mindpetk there are still notes that should be placed outside elements. Maybe we should try for this in 3.1 but for now just leave it? I'm not sure nobody has them anyway...
When
s
orseg
oru
start/end with note/incident, then note/incident should be moved to the parent element (bubble up in ancestor axis).
Nicely put. Interestingly, gap doesn't do this.
I am thinking about implementing a one-purpose script that solves it because this is quite a common mistake...
That would be of course great. And it could be included in the finalize script.
Unique main title
Title should be unique in the corpus https://clarin-eric.github.io/ParlaMint/#exa-titleStmtComp
https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT_1993-01-04-seimas-1-1.xml#L5-L6
wrong corpus timespan
bibl
timespansetting
timespanCorpus timespan in title:
bibl: https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT.xml#L66
setting: https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT.xml#L127
missing current governments
From the data, it seems that the last government ended on 2020-12-11, and a new one was not established. https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT-listOrg.xml#L70-L75
LT has unicameral system
#parla.uni
https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT-listOrg.xml#L76
should be
I believe
to
date in current termto
date in current termIs it possible to have an early election in Lithuania? if yes, then
to
attribute should be removed: https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT-listOrg.xml#L120otherwise you can leave it as it is
to
date in coallition/oppositionto
date in coallition/oppositionI suggest to remove
to
date in current coalition and opposition, because there can be changes in future: https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT-listOrg.xml#L1747-L1757opposition is to the government
https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT-listOrg.xml#L1525-L1529
should be:
note: there are multiple occurrences of this bug
affiliations that ends in future
to
date in affiliationSome affiliations end in future, and I guess
to
should be removed in these casessplit multiple names
Multiple names are better to be split into multiple elements:
should be
abbreviated forename
I suggest to use:
@TomazErjavec, do you agree? This is the only corpus that has it, as far as I know. But I think it is good to indicate that
P
is not a forename (it is only the initial letter).Another possibility is to reconstruct the forename from parliamentary proceedings. The speaker is usually mentioned in the preceding chairman's speech. (We used this attitude in ParlaMint-UA because there are a lot of guest speakers)
BTW is not he the same person as
VitkevičiusPranciškusStanislavas
use correct dates in subcorpus taxonomy
https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-taxonomy-subcorpus.xml
see:
https://github.com/clarin-eric/ParlaMint/blob/031ec3009386a4bfec60bf0e22f653a813ddf98c/Data/ParlaMint-CZ/ParlaMint-taxonomy-subcorpus.xml
bibl URL is referring to wrong source
https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT_1993-01-04-seimas-1-1.xml#L64
refers to sitting from 2013-12-17
Some utterances looks more like notes
https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT_2020-05-21-seimas-8-1.xml#L140
or https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT_2020-05-21-seimas-8-1.xml#L147-L149
speaker note
speaker notes are missing: https://github.com/mindpetk/ParlaMint/blob/5bff893e47533c3e9543d13f1f35a380ef3776d2/Data/ParlaMint-LT/ParlaMint-LT_2020-05-21-seimas-8-1.xml#L150
can be:
trailing and leading notes
Trailing and leading notes should be outside utterances: https://clarin-eric.github.io/ParlaMint/#para-hierarchy-comments