ParlaMint: Comparable Parliamentary Corpora
SE feedback #436

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

component filenames

can you please rename component files according to the recommendations: 2.3. File names and directory structure

wrong meeting text content

        <meeting n="2014-2018" ana="#parla.uni #parla.term">Mandatperioden 2014–2018</meeting>
        <meeting n="2018-2022" ana="#parla.uni #parla.term">Mandatperioden 2014–2018</meeting>
        <meeting n="2022-2026" ana="#parla.uni #parla.term">Mandatperioden 2014–2018</meeting>

missing Swedish translations in taxonomies

remove unused taxonomies

I guess you can remove this taxonomy, it was used in CZ corpus and it seems that you don't use it.

        <taxonomy xml:id="parla.links">
          <desc xml:lang="en">
            <term>Types of links</term>
          <category xml:id="">
            <catDesc xml:lang="en">
          <category xml:id="parla.print">
            <catDesc xml:lang="en">

wrong date in corpus root setting

Wierd event label

              <event from="2014-09-29" to="2018-09-24">
                <label>Riksdagen {start} - {end}</label>
              <event from="2018-09-24" to="2018-09-11">
                <label>Riksdagen {start} - {end}</label>

invalid date in parliament organization

from should start before to.

Thanks for this bug. It seems that our validation is not paranoic enough. (@matyaskopp, extend validation)

              <event from="2018-09-24" to="2018-09-11">

missing term in parliament organization

There should be three terms in parliament organization. Expecting it owing to:

        <meeting n="2014-2018" ana="#parla.uni #parla.term">Mandatperioden 2014–2018</meeting>
        <meeting n="2018-2022" ana="#parla.uni #parla.term">Mandatperioden 2014–2018</meeting>
        <meeting n="2022-2026" ana="#parla.uni #parla.term">Mandatperioden 2014–2018</meeting>

missing opposition relation

Do you have opposition in the Swedish parliament?

split forename

if someone has multiple names, each should have its own element

              <forename>Mubarik Mohamed</forename>

should be


component file meeting

The meeting element in the component file should specify the content of file (eg use parla.sitting it it contains a sitting day)

        <meeting n="201516" ana="#parla.uni #parla.term">2015/16</meeting>

CZ sample:

debates beginning

It is possible that I don't understand it. Sittings in your data start with a weird sequence of unknown speakers and notes. @TomazErjavec can you help me with the feedback here?

Some notes look similar to some notes...

<!--not speech-->        <note xml:id="i-F7gsURjTZEfW8BSTvgvtvN">2021/22:89 LKAB:s nekade miljötillstånd i Kiruna</note>
        <note xml:id="i-CkVMpNDmX4b4zUAa45Xkde">av Eric Palmqvist (SD)</note>
        <note xml:id="i-RGrNMU8A8Yer5Nyh4RDLh9">till miljö- och klimatminister Per Bolund (MP)</note>
        <u who="#unknown" xml:id="i-5a88c9462f80f70f-9" ana="#regular">
<!--speech-->          <seg xml:id="i-VHamknouNCJnLu76eNSTHt">2021/22:90 LKAB:s roll som föredöme för svensk gruvnäring</seg>
        <note xml:id="i-D7ictpqDrjV2juEZUg2tLP">av Eric Palmqvist (SD)</note>
        <note xml:id="i-VoTcW3RB81Xk4kmudScyyE">till näringsminister Ibrahim Baylan (S)</note>
        <note xml:id="i-P92ws7rgavuwX1dZBmGbW2">2021/22:91 Sanktionsavgiften vid otillåten cabotagetrafik</note>

and even the linguistic annotation is weird for this situations:

        <u ana="#regular" who="#unknown" xml:id="i-5a88c9462f80f70f-9">
          <seg xml:id="i-VHamknouNCJnLu76eNSTHt">
            <s xml:id="i-LgoXkLeyomJbuwwr872Q9b">
              <w lemma="2021" msd="UPosTag=X" xml:id="i-LefRqHrY2HECaVJXpABTUH">2021</w>
              <w lemma="/" msd="UPosTag=X" xml:id="i-LefSUmn5in5PaGgMF1Z4Mb">/</w>
              <w lemma="22:90" msd="UPosTag=X" xml:id="i-LefSi6jD8CWcWKvYx4vv8u">22:90</w>
              <w lemma="LKAB:s" msd="UPosTag=X" xml:id="i-LefSsG8cLgBhmjuSVvNpVB">LKAB:s</w>
              <w lemma="roll" msd="UPosTag=X" xml:id="i-LefT3FqxPk1cyHLbHDQiKf">roll</w>
              <w lemma="som" msd="UPosTag=X" xml:id="i-LefTE5sFHPzN6xE1Hx2beM">som</w>
              <w lemma="föredöme" msd="UPosTag=X" xml:id="i-LefTQ5abLTpHJVfA5F4VUq">föredöme</w>
              <w lemma="för" msd="UPosTag=X" xml:id="i-LefTYQg3iMLYdnBnPevQLu">för</w>
              <w lemma="svensk" msd="UPosTag=X" xml:id="i-LefTha5Svq1duCAfwWNJhB">svensk</w>
              <w lemma="gruvnäring" msd="UPosTag=X" xml:id="i-LefTquAuJiXuEUhJFvEDZF">gruvnäring</w>
              <linkGrp targFunc="head argument" type="UD-SYN">
                <link ana="ud-syn:root" target="#i-LgoXkLeyomJbuwwr872Q9b #i-LefRqHrY2HECaVJXpABTUH"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefSUmn5in5PaGgMF1Z4Mb"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefSi6jD8CWcWKvYx4vv8u"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefSsG8cLgBhmjuSVvNpVB"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefT3FqxPk1cyHLbHDQiKf"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefTE5sFHPzN6xE1Hx2beM"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefTQ5abLTpHJVfA5F4VUq"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefTYQg3iMLYdnBnPevQLu"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefTha5Svq1duCAfwWNJhB"/>
                <link ana="ud-syn:dep" target="#i-LefRqHrY2HECaVJXpABTUH #i-LefTquAuJiXuEUhJFvEDZF"/>

missing chairperson

speeches split by paragraphs

You are starting a new utterance whenever a new paragraph starts. There is no speaker change...

        <note type="speaker" xml:id="i-YATjvzUYvcpbu5mj1LLdGp">Anf. 1 Justitie- och migrationsminister MORGAN JOHANSSON (S):</note>
        <u xml:id="i-94b3b97e4e02f441-6" next="#i-94b3b97e4e02f441-11" who="#Q5887217" ana="#regular">
          <seg xml:id="i-ABY4HuV5fZywtFRhwF4mcW">Fru talman! Mats Green har ställt ett antal frågor, varav de flesta avser mottagandet av ensamkommande barn.</seg>
        <u xml:id="i-94b3b97e4e02f441-7" prev="#i-94b3b97e4e02f441-6" who="#Q5887217" ana="#regular">
          <seg xml:id="i-AWTnuk5hjL6euJkcekandQ">Sverige tar ett stort ansvar för människor på flykt. Många av dem som söker sig till Sverige är ensamkommande barn. Till och med mitten av oktober hade det kommit över 17 000 ensamkommande barn. Under de senaste tre månaderna har det kommit mellan 700 och drygt 2 000 barn per vecka. Det är en extraordinär situation.</seg>
        <u xml:id="i-94b3b97e4e02f441-8" prev="#i-94b3b97e4e02f441-6" who="#Q5887217" ana="#regular">
          <seg xml:id="i-D29SdDboPa9fkjXZR6GX51">Jag vill börja med att redogöra för hur regeringen hanterar och underlättar de utmaningar som ansvarstagande kommuner ställs inför. I budgetpropositionen för 2016 redovisar regeringen satsningar på sammanlagt ca 2 miljarder kronor under 2016 för bättre mottagande och snabbare etablering. Bland annat höjs schablonersättningen till kommuner för mottagande av nyanlända med ca 50 procent. Denna ersättning utgår även för ensamkommande barn. Vidare höjs schablonersättningen för asylsökande barns skolgång, också det med 50 procent.</seg>

I don't understand the usage of @next(referring to the following speech - not u) and prev(referring to the first element u of a sequence of u elements that creates one speech)

matyaskopp commented 1 year ago

@matyaskopp the sample is now updated. Where do you want the factorized files?

just sample files, it is easier to view the files

ninpnin commented 1 year ago

@matyaskopp there you go

matyaskopp commented 1 year ago

I cant see the sv translations and it seems that some of your translations are in Czech:

<?xml version="1.0" encoding="UTF-8"?>
<taxonomy xmlns=""
   <desc xml:lang="sv">
   <desc xml:lang="en">
   <category xml:id="parla.agenda">
      <catDesc xml:lang="sv">
         <term>Bod jednání</term>
      <catDesc xml:lang="en">
         <term>Agenda</term>: topic discussed during sitting</catDesc>

I am not sure if you are using this taxonomy, if not then it should be removed

ninpnin commented 1 year ago

@matyaskopp I had removed that manually from the zip files. Now it's gone also in the sample.

sv translations are not included here as we have no one to write them right now

matyaskopp commented 1 year ago

empty lemma and text

@TomazErjavec I think this shouldn't pass the validation, but it does...

              <name type="MISC">
                <w msd="UPosTag=PROPN|Case=Nom" lemma="­" xml:id="i-N1MwFgggp1YMJedEL4fxZA">­</w>
                <w msd="UPosTag=PROPN|Case=Nom" lemma="na" xml:id="i-N1MwQ1n9Bu4cdw9reUXsRE">na</w>

it is caused by additional space in TEI version:

Men de här saker ­ na gör att det blir bättre.

another samples

<name type="MISC">
  <w msd="UPosTag=PROPN|Case=Nom" lemma="­" xml:id="i-N1QtESTYvdPtaXhwUGEWkU">­</w>
<w msd="UPosTag=X" lemma="­" xml:id="i-N1QwuffbPNMTVRdAoCGFFS">­</w>
TomazErjavec commented 1 year ago

Indeed it should fail. I have fixied this (and for other linguistic attribtues) in 0c98b4c, documentation branch. Probably a good idea to merge soon into main. And hopefully all submitted corpora will not fail now!

As for SE, the empty values will need to be fixed now, sorry @ninpnin.

matyaskopp commented 1 year ago

Indeed it should fail. I have fixied this (and for other linguistic attribtues) in 0c98b4c, documentation branch. Probably a good idea to merge soon into main. And hopefully all submitted corpora will not fail now!

@TomazErjavec, merged without any effect:

ninpnin commented 1 year ago

They are not empty, but soft hyphens instead. Anyway, I removed them. The updated sample is now uploaded and passes the new tests. I'm still re-running the annotation pipeline for the whole corpus.

TomazErjavec commented 1 year ago

Here is a link to the files:

I took these files and processed them before the discussion on "empty" lemmas. It turns our that havng RNG validation on this was not crucual after all, as they result in non-valid CoNLL-U, so such mistakes could be caught before too, although somewhat later in the chain (and somewhat harder to identify in the xml).

Anyway, the log of the 2.12 SE validation is, as before, at pls. grep as before.

ninpnin commented 1 year ago

@TomazErjavec these errors are pretty obscure to me

[Line 92250 Sent i-P9RyLib6MXXYLHhKA5xCtt]: [L1 Format empty-column] Empty value in column HEAD.
[Line 92250 Sent i-P9RyLib6MXXYLHhKA5xCtt]: [L1 Format empty-column] Empty value in column DEPREL.
[Line 92250 Sent i-P9RyLib6MXXYLHhKA5xCtt]: [L2 Syntax invalid-deprel] Invalid DEPREL value ''.
[Line 92250 Sent i-P9RyLib6MXXYLHhKA5xCtt]: [L2 Syntax unknown-deprel] Unknown DEPREL label: ''

Are they related to the empty 'words' or not?

I.e. should I send you the corpus with the empty lemmas fixed, or look more into these errors?

TomazErjavec commented 1 year ago

Are they related to the empty 'words' or not?

Not sure. I'd expcet the error to be different in this case, this one pertains to the parse, not lemma

should I send you the corpus with the empty lemmas fixed, or look more into these errors?

A quick look wouldn't hurt.

matyaskopp commented 1 year ago

I don't think it is related to empty words:

[Line 731 Sent i-3icmcZEm9ifBRnS4atoL6x]: [L1 Format empty-column] Empty value in column HEAD.
[Line 731 Sent i-3icmcZEm9ifBRnS4atoL6x]: [L1 Format empty-column] Empty value in column DEPREL.
Format errors: 2
*** FAILED *** with 2 errors

Is produced by this sentence, where #i-3hfqJbRtFjSBjLhYyKrNbN is not linked in the dependency tree:

<s xml:id="i-3icmcZEm9ifBRnS4atoL6x">
  <w msd="UPosTag=PRON|Case=Nom|Definite=Def|Gender=Com,Neut|Number=Plur" lemma="de" xml:id="i-3hfq9S2V3Fm6TvifRUQUF6">De</w>
<!-- not in tree: -->
  <w msd="UPosTag=VERB|Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act" lemma="skapa" xml:id="i-3hfqJbRtFjSBjLhYyKrNbN">skapar</w>
  <w msd="UPosTag=ADV|Degree=Cmp" lemma="snarare" xml:id="i-3hfqSvXLdcxT4dEBHjiHTS">snarare</w>
  <w msd="UPosTag=ADJ|Case=Nom|Definite=Def,Ind|Degree=Pos|Gender=Com,Neut|Number=Plur" lemma="ny" xml:id="i-3hfqcvEgggnNGAfL52kBHv">nya</w>
  <w msd="UPosTag=CCONJ" lemma="och" xml:id="i-3hfqn5e5uATTXaeDctC5eC">och</w>
  <w msd="UPosTag=ADJ|Case=Nom|Definite=Ind|Degree=Pos|Gender=Com,Neut|Number=Plur" lemma="fler" xml:id="i-3hfqx5MRxEHNj85NQBDyUg">fler</w>
  <w msd="UPosTag=NOUN|Case=Nom|Definite=Ind|Gender=Neut|Number=Plur" lemma="problem" xml:id="i-3hfr7EkqAhxTzY4Fx2fspx">problem</w>
  <w msd="UPosTag=ADP" lemma="av" xml:id="i-3hfrFZrHYbUjKpatGSXnh2">av</w>
  <w msd="UPosTag=DET|Definite=Ind|Gender=Com,Neut|Number=Plur,Sing" lemma="samma" xml:id="i-3hfrRZZdbfJeXN233jZgXW">samma</w>
  <w msd="UPosTag=NOUN|Abbr=Yes" lemma="art" xml:id="i-3hfraiy2p8yjnmzvbb1asn">art.</w>
  <w msd="UPosTag=ADV" lemma="därför" xml:id="i-3hfrmYzKhnxUvStLcKdUCU">Därför</w>
  <w msd="UPosTag=VERB|Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act" lemma="ha" xml:id="i-3hfrut5n5gUkFjQxvjVP4Y">har</w>
  <w msd="UPosTag=PRON|Case=Nom|Definite=Def|Gender=Com|Number=Plur" lemma="vi" xml:id="i-3hfs5so88kJfTGr7i2XGu2" join="right">vi</w>
  <pc msd="UPosTag=PUNCT" xml:id="i-3hfsFsWUBp8aepHGVKZAjW">,</pc>
  <w msd="UPosTag=NOUN|Case=Nom|Definite=Ind|Gender=Com|Number=Sing" lemma="herr" xml:id="i-3hfsR2usQHofvEGA3B155n">herr</w>
  <w msd="UPosTag=NOUN|Case=Nom|Definite=Ind|Gender=Com|Number=Sing" lemma="talman" xml:id="i-3hfsZN1KnBKwFWnnMarywr" join="right">talman</w>
  <pc msd="UPosTag=PUNCT" xml:id="i-3hfsjMifqF9rT4Dw8stsnL">,</pc>
  <w msd="UPosTag=VERB|VerbForm=Sup|Voice=Act" lemma="välja" xml:id="i-3hfsuMS1tJymebf5vAvmcp">valt</w>
  <w msd="UPosTag=PART" lemma="att" xml:id="i-3hft3gXUGCW2ytBiEangUt">att</w>
  <w msd="UPosTag=VERB|VerbForm=Inf|Voice=Act" lemma="yrka" xml:id="i-3hftCqvsUgB8FJAbnSEaqA">yrka</w>
  <w msd="UPosTag=NOUN|Case=Nom|Definite=Ind|Gender=Neut|Number=Sing" lemma="avslag" xml:id="i-3hftNqeDXk13SqbkZjGUfe">avslag</w>
  <w msd="UPosTag=ADP" lemma="på" xml:id="i-3hftYqMZaopxeP2uM2JNW8">på</w>
  <w msd="UPosTag=NOUN|Case=Nom|Definite=Def|Gender=Com|Number=Sing" lemma="proposition" xml:id="i-3hfthzkxoHW3uo1ntskGrQ" join="right">propositionen</w>
  <pc msd="UPosTag=PUNCT" xml:id="i-3hftrKrRBB2KF5YRDHcBiU">.</pc>
  <linkGrp targFunc="head argument" type="UD-SYN">
    <link ana="ud-syn:nsubj" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfq9S2V3Fm6TvifRUQUF6"/>
    <link ana="ud-syn:advmod" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfqSvXLdcxT4dEBHjiHTS"/>
    <link ana="ud-syn:dep" target="#i-3hfsR2usQHofvEGA3B155n #i-3hfqcvEgggnNGAfL52kBHv"/>
    <link ana="ud-syn:amod" target="#i-3hfsjMifqF9rT4Dw8stsnL #i-3hfqn5e5uATTXaeDctC5eC"/>
    <link ana="ud-syn:dep" target="#i-3hfsR2usQHofvEGA3B155n #i-3hfqx5MRxEHNj85NQBDyUg"/>
    <link ana="ud-syn:obj" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfr7EkqAhxTzY4Fx2fspx"/>
    <link ana="ud-syn:nmod" target="#i-3hfsjMifqF9rT4Dw8stsnL #i-3hfrFZrHYbUjKpatGSXnh2"/>
    <link ana="ud-syn:det" target="#i-3hftCqvsUgB8FJAbnSEaqA #i-3hfrRZZdbfJeXN233jZgXW"/>
    <link ana="ud-syn:advmod" target="#i-3hfsuMS1tJymebf5vAvmcp #i-3hfraiy2p8yjnmzvbb1asn"/>
    <link ana="ud-syn:advmod" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfrmYzKhnxUvStLcKdUCU"/>
    <link ana="ud-syn:nsubj" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfs5so88kJfTGr7i2XGu2"/>
    <link ana="ud-syn:punct" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfsFsWUBp8aepHGVKZAjW"/>
    <link ana="ud-syn:det" target="#i-3hfsZN1KnBKwFWnnMarywr #i-3hfsR2usQHofvEGA3B155n"/>
    <link ana="ud-syn:obj" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfsZN1KnBKwFWnnMarywr"/>
    <link ana="ud-syn:punct" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfsjMifqF9rT4Dw8stsnL"/>
    <link ana="ud-syn:dep" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfsuMS1tJymebf5vAvmcp"/>
    <link ana="ud-syn:obj" target="#i-3hfsuMS1tJymebf5vAvmcp #i-3hft3gXUGCW2ytBiEangUt"/>
    <link ana="ud-syn:punct" target="#i-3hft3gXUGCW2ytBiEangUt #i-3hftCqvsUgB8FJAbnSEaqA"/>
    <link ana="ud-syn:obj" target="#i-3hftCqvsUgB8FJAbnSEaqA #i-3hftNqeDXk13SqbkZjGUfe"/>
    <link ana="ud-syn:advmod" target="#i-3hftCqvsUgB8FJAbnSEaqA #i-3hftYqMZaopxeP2uM2JNW8"/>
    <link ana="ud-syn:advmod" target="#i-3hftYqMZaopxeP2uM2JNW8 #i-3hfthzkxoHW3uo1ntskGrQ"/>
    <link ana="ud-syn:punct" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hftrKrRBB2KF5YRDHcBiU"/>
    <link ana="ud-syn:root" target="#i-3icmcZEm9ifBRnS4atoL6x #i-3hfrut5n5gUkFjQxvjVP4Y"/>
# sent_id = i-3icmcZEm9ifBRnS4atoL6x
# text = De skapar snarare nya och fler problem av samma art. Därför har vi, herr talman, valt att yrka avslag på propositionen.
1       De      de      PRON    _       Case=Nom|Definite=Def|Gender=Com,Neut|Number=Plur       12      nsubj   _       NER=O
2       skapar  skapa   VERB    _       Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act                      _       NER=O
3       snarare snarare ADV     _       Degree=Cmp      12      advmod  _       NER=O
4       nya     ny      ADJ     _       Case=Nom|Definite=Def,Ind|Degree=Pos|Gender=Com,Neut|Number=Plur        15      dep     _       NER=O
5       och     och     CCONJ   _       _       17      amod    _       NER=O
6       fler    fler    ADJ     _       Case=Nom|Definite=Ind|Degree=Pos|Gender=Com,Neut|Number=Plur    15      dep     _       NER=O
7       problem problem NOUN    _       Case=Nom|Definite=Ind|Gender=Neut|Number=Plur   12      obj     _       NER=O
8       av      av      ADP     _       _       17      nmod    _       NER=O
9       samma   samma   DET     _       Definite=Ind|Gender=Com,Neut|Number=Plur,Sing   20      det     _       NER=O
10      art.    art     NOUN    _       Abbr=Yes        18      advmod  _       NER=O
11      Därför  därför  ADV     _       _       12      advmod  _       NER=O
12      har     ha      VERB    _       Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act      0       root    _       NER=O
13      vi      vi      PRON    _       Case=Nom|Definite=Def|Gender=Com|Number=Plur    12      nsubj   _       NER=O|SpaceAfter=No
14      ,       ,       PUNCT   _       _       12      punct   _       NER=O
15      herr    herr    NOUN    _       Case=Nom|Definite=Ind|Gender=Com|Number=Sing    16      det     _       NER=O
16      talman  talman  NOUN    _       Case=Nom|Definite=Ind|Gender=Com|Number=Sing    12      obj     _       NER=O|SpaceAfter=No
17      ,       ,       PUNCT   _       _       12      punct   _       NER=O
18      valt    välja   VERB    _       VerbForm=Sup|Voice=Act  12      dep     _       NER=O
19      att     att     PART    _       _       18      obj     _       NER=O
20      yrka    yrka    VERB    _       VerbForm=Inf|Voice=Act  19      punct   _       NER=O
21      avslag  avslag  NOUN    _       Case=Nom|Definite=Ind|Gender=Neut|Number=Sing   20      obj     _       NER=O
22      på      på      ADP     _       _       20      advmod  _       NER=O
23      propositionen   proposition     NOUN    _       Case=Nom|Definite=Def|Gender=Com|Number=Sing    22      advmod  _       NER=O|SpaceAfter=No
24      .       .       PUNCT   _       _       12      punct   _       NER=O

@ninpnin BTW, you can now see that using semantically senseless IDs makes the debugging more complicated.

ninpnin commented 1 year ago

@matyaskopp Nahh, it's pretty easy to grep those things. Ofc it's a matter of taste. The problem for me is that I haven't fixed the seed, and thus the IDs change every time I regenerate the corpus.

BTW I found the issue, like 20 words in the whole corpus were incorrectly tagged as abbrevations which screwed up the sentence. Everything passes locally now, I expect to finally finish this thing today.

ninpnin commented 1 year ago

@TomazErjavec Here's the updated full corpus

TomazErjavec commented 1 year ago

Thanks, the log at We still have no. of words and covid date (which I fix, so ok), the parties without name (which you say are missing from source) and short dates (which I though you fixed). Still, just warnings, so ok. Importantly CoNLL-U looks good!

So, unless @matyaskopp protests, I think you are good for 3.0.

matyaskopp commented 1 year ago

So, unless @matyaskopp protests, I think you are good for 3.0.

No protest

@TomazErjavec, I checked the ParlaMint-SE-log, and the end of the file is strange, but I guess it does not cause any trouble:

make jvert-one
make[2]: Entering directory '/home/project/corpora/Parla/ParlaMint/V3'
/project/corpora/Parla/ParlaMint/ParlaMint/Scripts/ -codes HU -in Master -out Verts
INFO: ***Joining HU
find: ‘/home/project/corpora/Parla/ParlaMint/V3/Master/ParlaMint-HU.vert’: No such file or directory
cp: cannot stat '/home/project/corpora/Parla/ParlaMint/V3/Master/ParlaMint-HU.vert/*_hu.regi': No such file or directory
make[2]: Leaving directory '/home/project/corpora/Parla/ParlaMint/V3'
make pack-one
make[2]: Entering directory '/home/project/corpora/Parla/ParlaMint/V3'
/project/corpora/Parla/ParlaMint/ParlaMint/Scripts/ -codes 'HU' -in Master -out Transfer
INFO: ***Packing HU
INFO: *Packing ParlaMint-HU.TEI, ParlaMint-HU.txt
INFO: *Packing ParlaMint-HU.TEI.ana, ParlaMint-HU.conllu, ParlaMint-HU.vert
WARN: No ana root file, skipping
rsync -av Transfer/ParlaMint-HU.*
sending incremental file list

sent 49,941 bytes  received 87,238 bytes  39,194.00 bytes/sec
total size is 155,053,949  speedup is 1,130.30
make[2]: Leaving directory '/home/project/corpora/Parla/ParlaMint/V3'
rsync -av ParlaMint-SE.log
sending incremental file list
TomazErjavec commented 1 year ago

No protest

Great, @ninpnin, feel free to close.

@TomazErjavec, I checked the ParlaMint-SE-log, and the end of the file is strange, but I guess it does not cause any trouble:

Yes, I know. I started running HU before SE was finished but, indeed, no harm done.

ninpnin commented 1 year ago

@TomazErjavec it seems I can't close issues here

matyaskopp commented 1 year ago
