clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
42 stars 52 forks source link

SE feedback #436

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

component filenames

can you please rename component files according to the recommendations: 2.3. File names and directory structure

wrong meeting text content

https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.xml#L8-L10

        <meeting n="2014-2018" ana="#parla.uni #parla.term">Mandatperioden 2014–2018</meeting>
        <meeting n="2018-2022" ana="#parla.uni #parla.term">Mandatperioden 2014–2018</meeting>
        <meeting n="2022-2026" ana="#parla.uni #parla.term">Mandatperioden 2014–2018</meeting>

missing Swedish translations in taxonomies

remove unused taxonomies

I guess you can remove this taxonomy, it was used in CZ corpus and it seems that you don't use it. https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.xml#L265-L279

        <taxonomy xml:id="parla.links">
          <desc xml:lang="en">
            <term>Types of links</term>
          </desc>
          <category xml:id="parla.voting">
            <catDesc xml:lang="en">
              <term>Voting</term>
            </catDesc>
          </category>
          <category xml:id="parla.print">
            <catDesc xml:lang="en">
              <term>Regular</term>
            </catDesc>
          </category>
        </taxonomy>

wrong date in corpus root setting

Wierd event label

https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.xml#L324-L333

            <listEvent>
              <event from="2014-09-29" to="2018-09-24">
                <label>Riksdagen {start} - {end}</label>
              </event>
              <event from="2018-09-24" to="2018-09-11">
                <label>Riksdagen {start} - {end}</label>
              </event>
            </listEvent>

invalid date in parliament organization

from should start before to.

Thanks for this bug. It seems that our validation is not paranoic enough. (@matyaskopp, extend validation)

https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.xml#L328

              <event from="2018-09-24" to="2018-09-11">

missing term in parliament organization

There should be three terms in parliament organization. Expecting it owing to:

        <meeting n="2014-2018" ana="#parla.uni #parla.term">Mandatperioden 2014–2018</meeting>
        <meeting n="2018-2022" ana="#parla.uni #parla.term">Mandatperioden 2014–2018</meeting>
        <meeting n="2022-2026" ana="#parla.uni #parla.term">Mandatperioden 2014–2018</meeting>

missing opposition relation

Do you have opposition in the Swedish parliament?

split forename

if someone has multiple names, each should have its own element https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.ana.xml#L602

              <forename>Mubarik Mohamed</forename>

should be

              <forename>Mubarik</forename>
              <forename>Mohamed</forename>

component file meeting

The meeting element in the component file should specify the content of file (eg use parla.sitting it it contains a sitting day) https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE_201516--19.xml#L8

        <meeting n="201516" ana="#parla.uni #parla.term">2015/16</meeting>

CZ sample: https://github.com/clarin-eric/ParlaMint/blob/47a6a842d5a6447266f3ce0d95ad83bdac66673e/Data/ParlaMint-CZ/ParlaMint-CZ_2016-04-13-ps2013-044-02-013-114.xml#L13-L16

debates beginning

It is possible that I don't understand it. Sittings in your data start with a weird sequence of unknown speakers and notes. @TomazErjavec can you help me with the feedback here? https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE_202122--29.xml#L193

Some notes look similar to some notes...

<!--not speech-->        <note xml:id="i-F7gsURjTZEfW8BSTvgvtvN">2021/22:89 LKAB:s nekade miljötillstånd i Kiruna</note>
        <note xml:id="i-CkVMpNDmX4b4zUAa45Xkde">av Eric Palmqvist (SD)</note>
        <note xml:id="i-RGrNMU8A8Yer5Nyh4RDLh9">till miljö- och klimatminister Per Bolund (MP)</note>
        <u who="#unknown" xml:id="i-5a88c9462f80f70f-9" ana="#regular">
<!--speech-->          <seg xml:id="i-VHamknouNCJnLu76eNSTHt">2021/22:90 LKAB:s roll som föredöme för svensk gruvnäring</seg>
        </u>
        <note xml:id="i-D7ictpqDrjV2juEZUg2tLP">av Eric Palmqvist (SD)</note>
        <note xml:id="i-VoTcW3RB81Xk4kmudScyyE">till näringsminister Ibrahim Baylan (S)</note>
        <note xml:id="i-P92ws7rgavuwX1dZBmGbW2">2021/22:91 Sanktionsavgiften vid otillåten cabotagetrafik</note>

and even the linguistic annotation is weird for this situations:

        <u ana="#regular" who="#unknown" xml:id="i-5a88c9462f80f70f-9">
          <seg xml:id="i-VHamknouNCJnLu76eNSTHt">
            <s xml:id="i-LgoXkLeyomJbuwwr872Q9b">
              <w lemma="2021" msd="UPosTag=X" xml:id="i-LefRqHrY2HECaVJXpABTUH">2021</w>
              <w lemma="/" msd="UPosTag=X" xml:id="i-LefSUmn5in5PaGgMF1Z4Mb">/</w>
              <w lemma="22:90" msd="UPosTag=X" xml:id="i-LefSi6jD8CWcWKvYx4vv8u">22:90</w>
              <w lemma="LKAB:s" msd="UPosTag=X" xml:id="i-LefSsG8cLgBhmjuSVvNpVB">LKAB:s</w>
              <w lemma="roll" msd="UPosTag=X" xml:id="i-LefT3FqxPk1cyHLbHDQiKf">roll</w>
              <w lemma="som" msd="UPosTag=X" xml:id="i-LefTE5sFHPzN6xE1Hx2beM">som</w>
              <w lemma="föredöme" msd="UPosTag=X" xml:id="i-LefTQ5abLTpHJVfA5F4VUq">föredöme</w>
              <w lemma="för" msd="UPosTag=X" xml:id="i-LefTYQg3iMLYdnBnPevQLu">för</w>
              <w lemma="svensk" msd="UPosTag=X" xml:id="i-LefTha5Svq1duCAfwWNJhB">svensk</w>
              <w lemma="gruvnäring" msd="UPosTag=X" xml:id="i-LefTquAuJiXuEUhJFvEDZF">gruvnäring</w>
              <linkGrp targFunc="head argument" type="UD-SYN">
                <link ana="ud-syn:root" target="#i-LgoXkLeyomJbuwwr872Q9b #i-LefRqHrY2HECaVJXpABTUH"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefSUmn5in5PaGgMF1Z4Mb"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefSi6jD8CWcWKvYx4vv8u"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefSsG8cLgBhmjuSVvNpVB"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefT3FqxPk1cyHLbHDQiKf"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefTE5sFHPzN6xE1Hx2beM"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefTQ5abLTpHJVfA5F4VUq"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefTYQg3iMLYdnBnPevQLu"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefTha5Svq1duCAfwWNJhB"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefTquAuJiXuEUhJFvEDZF"/>
              </linkGrp>
            </s>
          </seg>
        </u>

missing chairperson

speeches split by paragraphs

You are starting a new utterance whenever a new paragraph starts. There is no speaker change... https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE_201516--19.xml#L140-L149

        <note type="speaker" xml:id="i-YATjvzUYvcpbu5mj1LLdGp">Anf. 1 Justitie- och migrationsminister MORGAN JOHANSSON (S):</note>
        <u xml:id="i-94b3b97e4e02f441-6" next="#i-94b3b97e4e02f441-11" who="#Q5887217" ana="#regular">
          <seg xml:id="i-ABY4HuV5fZywtFRhwF4mcW">Fru talman! Mats Green har ställt ett antal frågor, varav de flesta avser mottagandet av ensamkommande barn.</seg>
        </u>
        <u xml:id="i-94b3b97e4e02f441-7" prev="#i-94b3b97e4e02f441-6" who="#Q5887217" ana="#regular">
          <seg xml:id="i-AWTnuk5hjL6euJkcekandQ">Sverige tar ett stort ansvar för människor på flykt. Många av dem som söker sig till Sverige är ensamkommande barn. Till och med mitten av oktober hade det kommit över 17 000 ensamkommande barn. Under de senaste tre månaderna har det kommit mellan 700 och drygt 2 000 barn per vecka. Det är en extraordinär situation.</seg>
        </u>
        <u xml:id="i-94b3b97e4e02f441-8" prev="#i-94b3b97e4e02f441-6" who="#Q5887217" ana="#regular">
          <seg xml:id="i-D29SdDboPa9fkjXZR6GX51">Jag vill börja med att redogöra för hur regeringen hanterar och underlättar de utmaningar som ansvarstagande kommuner ställs inför. I budgetpropositionen för 2016 redovisar regeringen satsningar på sammanlagt ca 2 miljarder kronor under 2016 för bättre mottagande och snabbare etablering. Bland annat höjs schablonersättningen till kommuner för mottagande av nyanlända med ca 50 procent. Denna ersättning utgår även för ensamkommande barn. Vidare höjs schablonersättningen för asylsökande barns skolgång, också det med 50 procent.</seg>
        </u>

I don't understand the usage of @next(referring to the following speech - not u) and prev(referring to the first element u of a sequence of u elements that creates one speech)

ninpnin commented 1 year ago

Wierd event label

There's a missing f string in Python. Is

<event from="2014-09-29" to="2018-09-24">
  <label>Riksdagen 2014 - 2018</label>
</event>

correct?

debates beginning

This is how the original data is laid out, there is no clear distinction between debates and the metatext before that.

See: https://www.riksdagen.se/sv/dokument-lagar/dokument/protokoll/protokoll-202122141-mandagen-den-5-september_H909141

missing opposition relation

I tried to find a clear way to denote opposition and confidence and supply, but I didn't find one so I left it out. Is there one?

The rest of the points look like bugs that are rather straightforward and quick to fix.

matyaskopp commented 1 year ago

speaker duplicity?

One or two politicians?

https://github.com/ninpnin/ParlaMint/blob/793c08eaf543b2ca0aab2581d445103b95b4f1d0/Data/ParlaMint-SE/ParlaMint-SE.xml#L608-L625

          <person xml:id="Q59387749">
            <persName>
              <surname>Andersson</surname>
              <forename>Jonas</forename>
            </persName>
            <sex value="M"/>
            <affiliation role="member" ref="#Riksdagen" from="2018-09-24"/>
            <affiliation role="member" ref="#Q504069"/>
          </person>
          <person xml:id="Q58837098">
            <persName>
              <surname>Andersson</surname>
              <forename>Jonas</forename>
            </persName>
            <sex value="M"/>
            <affiliation role="member" ref="#Riksdagen" from="2018-09-24"/>
            <affiliation role="member" ref="#Q504069"/>
          </person>
ninpnin commented 1 year ago

Nah, Jonas Andersson might just be one of the most common names in Sweden.

https://www.riksdagen.se/sv/ledamoter-partier/ledamot/jonas-andersson_c5e9daf0-868f-4c8e-9a69-671c7b22855a

https://www.riksdagen.se/sv/ledamoter-partier/ledamot/jonas-andersson_9944df03-e946-4046-aeef-c49117503a0c

ninpnin commented 1 year ago

Status 2022-11-15

matyaskopp commented 1 year ago

missing opposition relation

I tried to find a clear way to denote opposition and confidence and supply, but I didn't find one so I left it out. Is there one?

The determination of whether a party is in opposition is usually based on how the party sees itself or how it sees the public. In CZ, it is common for parties to declare that they are in opposition - they don't agree with the government and don't want to take responsibility for the government's doing. There is no contract saying someone is in opposition, so it is a bit fuzzy. To conclude: it is up to you how you see it...

TomazErjavec commented 1 year ago

I agree with @matyaskopp, except for countries with a majority government like Slovenia, where you are either in the coalition, and thus form the government, or you are in opposition. The only exception here are the independent MPs.

ninpnin commented 1 year ago

Okay. So we should

TomazErjavec commented 1 year ago

Not sure I quite understand, or, rather, who is then marked with "coalition"? Also, you don't realy " mark opposition parties with the 'opposition' tag", rather, you introduce a relation grouping them as opposition, cf. 5.2.4. Relations between organisations.

matyaskopp commented 1 year ago

debates beginning

This is how the original data is laid out, there is no clear distinction between debates and the metatext before that.

See: https://www.riksdagen.se/sv/dokument-lagar/dokument/protokoll/protokoll-202122141-mandagen-den-5-september_H909141

I do not understand Swedish, but with google translator, it seems to me that announcements start with:

Talmannen (meddelade|anmälde)

But it looks more-like a pronouncement that someone said something, so I guess it can be encoded as a note.

And the regular/chair speeches are highlighted and numbered:

Anf.  {number}  {name or TALMANNEN} ({party for regular speaker}):

and it is in h2 element in this format: https://www.riksdagen.se/sv/dokument-lagar/dokument/protokoll/protokoll-202122141-mandagen-den-5-september_H909141/html#_Toc115951428

Sometimes there are interpelations (https://www.riksdagen.se/sv/dokument-lagar/dokument/protokoll/protokoll-202122140-mandagen-den-5-september_H909140/html#_Toc115950042) that are in written (I guess) form and are stored in another place.

The question is how to determine the end of speech, there is sometimes applause and then continue a note that is not highlighted:

Vi från Miljöpartiet är tydliga: Vi måste stötta svenska företag och hushåll och civilsamhället genom detta. Vi måste stötta ekonomiskt, speciellt de allra svagaste. Vi måste få en ändring av prissättningen på elen. Vi mås­te bygga ut det förnybara, och i allt detta måste vi också energieffektivisera. Det är så vi bygger Sverige starkare tillsammans både på kort och på lång sikt.

(Applåder)

Överläggningen var härmed avslutad.

(Beslut fattades under § 8.)

I am not sure if I miss something, but it seems that your protocol contains more notes than speeches. So notes should be encoded as notes - not making speeches from them.

matyaskopp commented 1 year ago

incidents

incidents encoding documentation: https://clarin-eric.github.io/ParlaMint/#sec-incidents

<kinesic type="applause">
 <desc>(Applåder)</desc>
</kinesic>
ninpnin commented 1 year ago

@matyaskopp The paragraphs are annotated into utterances, segments etc automatically using BERT, which is why there are some occasiaonal misclassified "utterances" mixed in the metatext in the beginning of protocols. We can use some heuristics to improve the quality of that classification if that is necessary (the protocols 2015-2022 seem to be more consistent than the whole 1920-2022 period we're working with).

However, you also imply the protocols should not start with a bunch of notes. Our plan has been not to exclude any data from the original protocols, but rather to annotate so that eg. only utterances can be extracted afterwards in any downstream task. Can we go on doing this?

ninpnin commented 1 year ago

@TomazErjavec I mean, is it necessary to label the supporting parties (in the way the schema proposes, technicalities are not relevant to the question) at all, if we do label the governments and the opposition blocks? AFAIK, the definition of supporting parties can be a bit blurry in the Swedish parliament.

matyaskopp commented 1 year ago

@matyaskopp The paragraphs are annotated into utterances, segments etc automatically using BERT, which is why there are some occasiaonal misclassified "utterances" mixed in the metatext in the beginning of protocols. We can use some heuristics to improve the quality of that classification if that is necessary (the protocols 2015-2022 seem to be more consistent than the whole 1920-2022 period we're working with).

However, you also imply the protocols should not start with a bunch of notes. Our plan has been not to exclude any data from the original protocols, but rather to annotate so that eg. only utterances can be extracted afterwards in any downstream task. Can we go on doing this?

I did not know that the utterance annotation is done automatically with BERT, and I do not want to remove any data - beginning notes are ok to preserve, but there are flagrant mixtures of notes and misclassified utterances.

If the protocols from the ParlaMint period are more consistent, they can probably be segmented into utterances by rules: An utterance starts with something like this:

<h2>Anf.  1  TALMANNEN:</h2>
ninpnin commented 1 year ago

@matyaskopp I think we are missing the forest from the trees here. Are you available for a quick zoom call tomorrow or on friday?

matyaskopp commented 1 year ago

@matyaskopp I think we are missing the forest from the trees here. Are you available for a quick zoom call tomorrow or on friday?

@ninpnin, ok, friday is better (anytime before noon). Tommorow we have a state holiday and childrens are at home... Please send me an email with link and time that fits you.

TomazErjavec commented 1 year ago

Could you then also pls. discuss https://github.com/clarin-eric/ParlaMint/issues/436#issuecomment-1316504335, I can't answer that simply.

ninpnin commented 1 year ago
ninpnin commented 1 year ago

Status 2022-11-18

ninpnin commented 1 year ago

@matyaskopp I've made my changes, I think you can check the files again now.

matyaskopp commented 1 year ago

wrong date in corpus root setting

        <setting>
          <name type="org">Sveriges riksdag</name>
          <name type="address">Riksgatan 1</name>
          <name type="city">Stockholm</name>
          <name type="country">Sweden</name>
          <date when="2016-09-15" ana="#parla.sitting">2016-09-15</date>
        </setting>
matyaskopp commented 1 year ago

missing Swedish translations in taxonomies

  • [ ] taxonomies translations
matyaskopp commented 1 year ago

Any data from the current term?

according to https://github.com/ninpnin/ParlaMint/blob/18c7a5124a0a5c925a387031213888db583617a6/Data/ParlaMint-SE/ParlaMint-SE.xml#L316

          <org xml:id="Riksdagen" role="parliament" ana="#parla.uni #parla.national">
            <orgName full="yes" xml:lang="sv">Sveriges riksdag</orgName>
            <orgName full="abb" xml:lang="sv">Riksdagen</orgName>
            <listEvent>
              <event from="2014-09-29" to="2018-09-24">
                <label>Riksdagen 2014 - 2018</label>
              </event>
              <event from="2018-09-24" to="2022-09-27">
                <label>Riksdagen 2018 - 2022</label>
              </event>
              <event from="2022-09-27">
                <label>Riksdagen 2022 - 2026</label>
              </event>
            </listEvent>
          </org>

your current term started at 2022-09-27, if you don't have any text content from this term, the meeting should be removed https://github.com/ninpnin/ParlaMint/blob/18c7a5124a0a5c925a387031213888db583617a6/Data/ParlaMint-SE/ParlaMint-SE.xml#L10

        <meeting n="2022-2026" ana="#parla.uni #parla.term">Mandatperioden 2022–2026</meeting>

BTW, are you sure that there will not be an early election in Sweden? You are setting date in future in the text.

matyaskopp commented 1 year ago

missing @join="right"

https://clarin-eric.github.io/ParlaMint/#sec-ana-words

            <s xml:id="i-PDtgGeMQC837eq5Uk8pet4">
              <w msd="UPosTag=NOUN|Case=Nom|Definite=Ind|Gender=Com|Number=Sing" lemma="fru" xml:id="i-PDN9z16TfCMx8fbyzdAR3J">Fru</w>
<!-- next token should contain attribute join: -->
              <w msd="UPosTag=NOUN|Case=Nom|Definite=Ind|Gender=Com|Number=Sing" lemma="talman" xml:id="i-PDNASeziU3EPzn6PQjv8bv">talman</w> 
              <pc msd="UPosTag=PUNCT" xml:id="i-PDNAaz6AqvkfL4d1j9n3Tz">!</pc>
              <linkGrp targFunc="head argument" type="UD-SYN">
                <link ana="ud-syn:det" target="#i-PDNASeziU3EPzn6PQjv8bv #i-PDN9z16TfCMx8fbyzdAR3J"/>
                <link ana="ud-syn:punct" target="#i-PDNASeziU3EPzn6PQjv8bv #i-PDNAaz6AqvkfL4d1j9n3Tz"/>
                <link ana="ud-syn:root" target="#i-PDtgGeMQC837eq5Uk8pet4 #i-PDNASeziU3EPzn6PQjv8bv"/>
              </linkGrp>
            </s>
matyaskopp commented 1 year ago

element name is missing type

https://clarin-eric.github.io/ParlaMint/#sec-ner

              <name>
                <w msd="UPosTag=PROPN|Case=Nom" lemma="Mats" xml:id="i-PDNAiUsgPE86jDhNp83xqr">Mats</w>
              </name>
              <name>
                <w msd="UPosTag=PROPN|Case=Nom" lemma="Green" xml:id="i-PDNApK3JFMBtG7sDSD9uFJ">Green</w>
              </name>
matyaskopp commented 1 year ago

unused prefix

https://github.com/ninpnin/ParlaMint/blob/18c7a5124a0a5c925a387031213888db583617a6/Data/ParlaMint-SE/ParlaMint-SE.ana.xml#L474

        <prefixDef ident="ne" matchPattern="(.+)" replacementPattern="#NER.cnec2.0.$1">
          <p>Taxonomy for named entities (cnec2.0)</p>
        </prefixDef>
matyaskopp commented 1 year ago

remove table of content

I think toc, should be removed (it is not a debateSection) https://github.com/ninpnin/ParlaMint/blob/18c7a5124a0a5c925a387031213888db583617a6/Data/ParlaMint-SE/ParlaMint-SE_2015-10-23-prot-201516--19.xml#L659

      <div type="debateSection">
        <head xml:id="i-49WGkDYanfrBThGEhk84fS">§ 1 Anmälan om fördröjda svar på interpellationer</head>
        <note xml:id="i-JAysHk636TLJv63cvK4rPw">§ 2 Ärenden för hänvisning till utskott</note>
        <note xml:id="i-4TnEn6p6xvQ8jVnVStMZ4v">§ 3 Svar på interpellation 2015/16:49 om stöd till kommuner vid mottagande av ensamkommande flyktingbarn</note>
        <note xml:id="i-XxebtctPXck2uwgTMJJVEB" type="speaker">Anf. 1 Justitie- och migrationsminister MORGAN JOHANSSON (S)</note>
        <note xml:id="i-HhiKLKZn5mipTeviSARiJw" type="speaker">Anf. 2 MATS GREEN (M)</note>
        <note xml:id="i-DeeVZkTEi4U1kWpVjA2XCo" type="speaker">Anf. 3 Justitie- och migrationsminister MORGAN JOHANSSON (S)</note>
        <note xml:id="i-8Mh4hbZzi3Ak2AA2ohxtSc" type="speaker">Anf. 4 MATS GREEN (M)</note>
        <note xml:id="i-5dnmYPNMGXt65udU3NKhKb" type="speaker">Anf. 5 Justitie- och migrationsminister MORGAN JOHANSSON (S)</note>
matyaskopp commented 1 year ago

debate section

@TomazErjavec, I like the structuring of the document ( https://github.com/ninpnin/ParlaMint/blob/18c7a5124a0a5c925a387031213888db583617a6/Data/ParlaMint-SE/ParlaMint-SE_2015-10-23-prot-201516--19.xml) Adding div and head made it well arranged, but there are div[@type="debateSection"] which are not really debates. I tend to remove type="debateSection" and preserve the structure. Do you agree?

TomazErjavec commented 1 year ago

Hm, I don't like having a new typeless type of div. I would say these are either stand alone notes at the start of the body (before the first div), which would be the principle of minimal effort. The "proper" way of doing it would be to introduce <front>, as this is obviously front-matter, and front-matter should not be linguistcially annotated. But this means changing the schema, thinking about exactly what front can contain, and maybe chaning the corpora of other partners - do we want to do all this now? Third option: remove the ToC.

ninpnin commented 1 year ago

@matyaskopp @TomazErjavec the schema does not enumerate the values type can take, does it? Let's make it div type="tableOfContents" ?

I don't want to remove data. You'll never gonna notice if you've accidentally removed debate sections.

TomazErjavec commented 1 year ago

I don't want to remove data.

OK. Let's contunue this is #472.

ninpnin commented 1 year ago

Status 2022-11-23

ninpnin commented 1 year ago

@matyaskopp All the problems you reported (except for the missing translations) should be fixed now. I also decided to just bite the bullet and remove the TOCs.

I assume the debate section thing can be changed once you decide what to do with it, it should be easy enough from our side.

From my side, it would be good to know if the corpus is now at an acceptable standard. If not, I'd like to have all remaining critical problems listed here at once. I have limited time resources to go back and forth with this.

matyaskopp commented 1 year ago

Thanks for the changes in your corpus. Your corpus is now significantly better. I hope this is the final list of problems that are spottable in the sample with my tired eyes. There can appear another one when @TomazErjavec loads it into noSketch, because I am checking just the sample without seeing the whole corpus.

run factorization

There are still taxonomies that are not used in your corpus, I guess. Can you please run factorization, which extracts all taxonomies into separate files:

# factorize taxonomies:
make factorize-teiHeader-INPLACE-SE
# add new files into repository (taxonomies and list of persons and organizations)
git add Data/ParlaMint-SE/ParlaMint-SE-taxonomy-*.xml
git add Data/ParlaMint-SE/ParlaMint-taxonomy-*.xml
git add Data/ParlaMint-SE/ParlaMint-SE-list*.xml

section without utterances

472

Improve incident annotations

https://github.com/ninpnin/ParlaMint/blob/data/Data/ParlaMint-SE/ParlaMint-SE_2015-11-18-prot-201516--29.xml#L1015

<note xml:id="i-Sd8foAAkXywxAbqQKr4Ykt">( Applåder )</note>

I was not able to find a different type of incident, so I hope there is none.

Named entities

I don't know what model you are using because your application description doesn't mention it explicitly (proper name, version). But I believe that your model supports multi-token named entities, so this:

<name type="PER">
  <w msd="UPosTag=PROPN|Case=Nom" lemma="Morgan" xml:id="i-4Vkv8ELR4zHptJa1VWJFWk">Morgan</w>
</name>
<name type="PER">
  <w msd="UPosTag=PROPN|Case=Nom" lemma="Johansson" xml:id="i-4VkvE4W2w7McRCjr7bQBvC">Johansson</w>
</name>

should be

<name type="PER">
  <w msd="UPosTag=PROPN|Case=Nom" lemma="Morgan" xml:id="i-4Vkv8ELR4zHptJa1VWJFWk">Morgan</w>
  <w msd="UPosTag=PROPN|Case=Nom" lemma="Johansson" xml:id="i-4VkvE4W2w7McRCjr7bQBvC">Johansson</w>
</name>

missing subtitle in all files

We are not insisting on subtitle #480

https://clarin-eric.github.io/ParlaMint/#sec-titleStmt

The title statement starts with two titles (one main, the other subordinate), both in English and the local language, with the appropriate language code possibly inherited from a superordinate element. They are distinguished by the value main or sub of their type attribute and the value of their xml:lang attribute. In the example it can be seen that the main title of a corpus component is simply an extension of the corpus root title, as it also gives the name of the particular meeting that the component contains, while the subordinate title is, again, free text. Both titles must be unique in the complete corpus.

ninpnin commented 1 year ago

Status 2022-11-28

TomazErjavec commented 1 year ago

WONTFIX: our NER tool does not detect multi-token entities. We're already using a backup as the primary one does not work.

Well, this is sad. Swedish is hardly a less resourced language, I just checked with Mr. Google, and there are a lot of NER tools for Swedish, so I can't help wondering why you would use a crippled one... But if you are happy with Swedish having different and less usefull NEs from all the rest of the corpora, then on your head be it!

ninpnin commented 1 year ago

why you would use a crippled one

  • hfst-SweNER does not seem to be maintained anymore, and we don't have the time to debug the python2 code that's breaking
  • BERT/huggingface NER is easy to integrate to our python scripts, but while accurate is limited in features
  • Anything else would need more time to integrate into our codebase, and as mentioned, we don't have that
TomazErjavec commented 1 year ago

BERT/huggingface NER is easy to integrate to our python scripts, but while accurate is limited in features

Yes, this one seemed the most promising to me. I don't know what you mean by "limited in features", but I have problems imagining it is worse than having individual words as names.

Of course, there is another way, i.e. to join n successive names into one, at least in case they have the same class.

ninpnin commented 1 year ago

You mean to hack together something post-hoc? I mean that's possible but there I can come up with situations where that fails.

TomazErjavec commented 1 year ago

You mean to hack together something post-hoc?

Yes.

I mean that's possible but there I can come up with situations where that fails.

Not sure why, if they both have the same class , then just merge the two names, if not, leave them apart (unless there is some nice regularity that you would observe, but this could be overdoing it). I image most would be two PER, and PER is also the most useful for further analysis (who mentions who).

ninpnin commented 1 year ago

Well, then I'll implement that heuristic. Let's hope the edge cases that break it are few and far in between.

ninpnin commented 1 year ago

Status 2022-11-29

ninpnin commented 1 year ago

@matyaskopp @TomazErjavec Status 2022-12-02

Here is a link to the files:

https://github.com/ninpnin/ParlaMint/releases/tag/v2.1

TomazErjavec commented 1 year ago

Here is a link to the files

I take it this is suposed to be the fill TEI (but not .ana) encoded version? If so:

I corrected the XIncludes localy, so I could try the finalization step, which does have some errors nd warnings, cf. https://nl.ijs.si/et/tmp/ParlaMint/Repo/ParlaMint-SE.log and do "grep -i error" and "grep -i warning"

ninpnin commented 1 year ago

@TomazErjavec there are no errors when I download the log file?

TomazErjavec commented 1 year ago

Yup, there are none, very nice. This was stock advice. So, just do $ grep -i warning

ninpnin commented 1 year ago

Unique warnings

WARN ParlaMint-SE_2019-11-07-prot-201920--28: fixing subcorpus to covid for date 2019-11-07
WARN ParlaMint-SE_2019-11-08-prot-201920--29: fixing subcorpus to covid for date 2019-11-08
WARN ParlaMint-SE_2019-11-12-prot-201920--30: fixing subcorpus to covid for date 2019-11-12
WARN ParlaMint-SE_2019-11-13-prot-201920--31: fixing subcorpus to covid for date 2019-11-13
WARN ParlaMint-SE_2019-11-14-prot-201920--32: fixing subcorpus to covid for date 2019-11-14
WARN ParlaMint-SE_2019-11-15-prot-201920--33: fixing subcorpus to covid for date 2019-11-15
WARN ParlaMint-SE_2019-11-19-prot-201920--34: fixing subcorpus to covid for date 2019-11-19
WARN ParlaMint-SE_2019-11-20-prot-201920--35: fixing subcorpus to covid for date 2019-11-20
WARN ParlaMint-SE_2019-11-21-prot-201920--36: fixing subcorpus to covid for date 2019-11-21
WARN ParlaMint-SE_2019-11-22-prot-201920--37: fixing subcorpus to covid for date 2019-11-22
WARN ParlaMint-SE_2019-11-26-prot-201920--38: fixing subcorpus to covid for date 2019-11-26
WARN ParlaMint-SE_2019-11-27-prot-201920--39: fixing subcorpus to covid for date 2019-11-27
WARN ParlaMint-SE_2019-11-28-prot-201920--40: fixing subcorpus to covid for date 2019-11-28
WARN ParlaMint-SE_2019-11-29-prot-201920--41: fixing subcorpus to covid for date 2019-11-29
WARN ParlaMint-SE_2019-12-02-prot-201920--42: fixing subcorpus to covid for date 2019-12-02
WARN ParlaMint-SE_2019-12-03-prot-201920--43: fixing subcorpus to covid for date 2019-12-03
WARN ParlaMint-SE_2019-12-04-prot-201920--44: fixing subcorpus to covid for date 2019-12-04
WARN ParlaMint-SE_2019-12-05-prot-201920--45: fixing subcorpus to covid for date 2019-12-05
WARN ParlaMint-SE_2019-12-06-prot-201920--46: fixing subcorpus to covid for date 2019-12-06
WARN ParlaMint-SE_2019-12-09-prot-201920--47: fixing subcorpus to covid for date 2019-12-09
WARN ParlaMint-SE_2019-12-10-prot-201920--48: fixing subcorpus to covid for date 2019-12-10
WARN ParlaMint-SE_2019-12-11-prot-201920--49: fixing subcorpus to covid for date 2019-12-11
WARN ParlaMint-SE_2019-12-12-prot-201920--50: fixing subcorpus to covid for date 2019-12-12
WARN ParlaMint-SE_2019-12-13-prot-201920--51: fixing subcorpus to covid for date 2019-12-13
WARN ParlaMint-SE_2019-12-16-prot-201920--52: fixing subcorpus to covid for date 2019-12-16
WARN ParlaMint-SE_2019-12-17-prot-201920--53: fixing subcorpus to covid for date 2019-12-17
WARN ParlaMint-SE_2019-12-18-prot-201920--54: fixing subcorpus to covid for date 2019-12-18
WARN ParlaMint-SE_2019-12-19-prot-201920--55: fixing subcorpus to covid for date 2019-12-19
WARN ParlaMint-SE_2019-12-20-prot-201920--56: fixing subcorpus to covid for date 2019-12-20
WARN: /project/corpora/Parla/ParlaMint/V3/Data/ParlaMint-SE.TEI/ParlaMint-SE-listOrg.xml not found
WARN: /project/corpora/Parla/ParlaMint/V3/Data/ParlaMint-SE.TEI/ParlaMint-SE-listPerson.xml not found
WARN: No .ana files for SE samples
WARN: No ana root file, skipping
WARN: party without proper name Q10585380
WARN: party without proper name Q3360009
WARN: party without proper name Q50383811
WARN: party without proper name Q61791721
WARN: short date 2006-05
WARN: short date 2016-10
TomazErjavec commented 1 year ago

AFAIK this is automatically fixed, @TomazErjavec confirm?

Yes.

ninpnin commented 1 year ago

@matyaskopp @TomazErjavec Status 2022-12-06

Here is a link to the files:

https://github.com/ninpnin/ParlaMint/releases/tag/v2.1.2

matyaskopp commented 1 year ago

@ninpnin great, can you update the sample on github, please?

And if possible factorize tei header:

# factorize taxonomies:
make factorize-teiHeader-INPLACE-SE
# add new files into repository (taxonomies and list of persons and organizations)
git add Data/ParlaMint-SE/ParlaMint-SE-taxonomy-*.xml
git add Data/ParlaMint-SE/ParlaMint-taxonomy-*.xml
git add Data/ParlaMint-SE/ParlaMint-SE-list*.xml
ninpnin commented 1 year ago

@matyaskopp the sample is now updated. Where do you want the factorized files?