Closed matyaskopp closed 1 year ago
@matyaskopp the sample is now updated. Where do you want the factorized files?
just sample files, it is easier to view the files
@matyaskopp there you go https://github.com/ninpnin/ParlaMint/tree/data/Data/ParlaMint-SE
I cant see the sv
translations and it seems that some of your translations are in Czech:
https://github.com/clarin-eric/ParlaMint/blob/e2e2406742deacdc653b10db98820f5b40cab0cf/Data/ParlaMint-SE/ParlaMint-SE-taxonomy-meeting.parts.xml
<?xml version="1.0" encoding="UTF-8"?>
<taxonomy xmlns="http://www.tei-c.org/ns/1.0"
xml:id="ParlaMint-SE-taxonomy-meeting.parts"
xml:lang="sv">
<desc xml:lang="sv">
<term>Bod</term>
</desc>
<desc xml:lang="en">
<term>Agenda</term>
</desc>
<category xml:id="parla.agenda">
<catDesc xml:lang="sv">
<term>Bod jednání</term>
</catDesc>
<catDesc xml:lang="en">
<term>Agenda</term>: topic discussed during sitting</catDesc>
</category>
</taxonomy>
I am not sure if you are using this taxonomy, if not then it should be removed
@matyaskopp I had removed that manually from the zip files. Now it's gone also in the sample.
sv translations are not included here as we have no one to write them right now
@TomazErjavec I think this shouldn't pass the validation, but it does...
<name type="MISC">
<w msd="UPosTag=PROPN|Case=Nom" lemma="" xml:id="i-N1MwFgggp1YMJedEL4fxZA"></w>
<w msd="UPosTag=PROPN|Case=Nom" lemma="na" xml:id="i-N1MwQ1n9Bu4cdw9reUXsRE">na</w>
</name>
it is caused by additional space in TEI version:
Men de här saker na gör att det blir bättre.
another samples
<name type="MISC">
<w msd="UPosTag=PROPN|Case=Nom" lemma="" xml:id="i-N1QtESTYvdPtaXhwUGEWkU"></w>
</name>
<w msd="UPosTag=X" lemma="" xml:id="i-N1QwuffbPNMTVRdAoCGFFS"></w>
Indeed it should fail. I have fixied this (and for other linguistic attribtues) in 0c98b4c, documentation branch. Probably a good idea to merge soon into main. And hopefully all submitted corpora will not fail now!
As for SE, the empty values will need to be fixed now, sorry @ninpnin.
Indeed it should fail. I have fixied this (and for other linguistic attribtues) in 0c98b4c, documentation branch. Probably a good idea to merge soon into main. And hopefully all submitted corpora will not fail now!
@TomazErjavec, merged without any effect: https://github.com/ninpnin/ParlaMint/commit/0c98b4c72259b04aba8dc80a7047f36c39a4972c
https://github.com/clarin-eric/ParlaMint/actions/runs/3631965050
They are not empty, but soft hyphens instead. Anyway, I removed them. The updated sample is now uploaded and passes the new tests. I'm still re-running the annotation pipeline for the whole corpus.
https://github.com/ninpnin/ParlaMint/commit/27a7991321589c4244aff43b3c78047986b27832
Here is a link to the files: https://github.com/ninpnin/ParlaMint/releases/tag/v2.1.2
I took these files and processed them before the discussion on "empty" lemmas. It turns our that havng RNG validation on this was not crucual after all, as they result in non-valid CoNLL-U, so such mistakes could be caught before too, although somewhat later in the chain (and somewhat harder to identify in the xml).
Anyway, the log of the 2.12 SE validation is, as before, at https://nl.ijs.si/et/tmp/ParlaMint/Repo/ParlaMint-SE.log pls. grep as before.
@TomazErjavec these errors are pretty obscure to me
[Line 92250 Sent i-P9RyLib6MXXYLHhKA5xCtt]: [L1 Format empty-column] Empty value in column HEAD.
[Line 92250 Sent i-P9RyLib6MXXYLHhKA5xCtt]: [L1 Format empty-column] Empty value in column DEPREL.
[Line 92250 Sent i-P9RyLib6MXXYLHhKA5xCtt]: [L2 Syntax invalid-deprel] Invalid DEPREL value ''.
[Line 92250 Sent i-P9RyLib6MXXYLHhKA5xCtt]: [L2 Syntax unknown-deprel] Unknown DEPREL label: ''
Are they related to the empty 'words' or not?
I.e. should I send you the corpus with the empty lemmas fixed, or look more into these errors?
Are they related to the empty 'words' or not?
Not sure. I'd expcet the error to be different in this case, this one pertains to the parse, not lemma
should I send you the corpus with the empty lemmas fixed, or look more into these errors?
A quick look wouldn't hurt.
I don't think it is related to empty words:
[Line 731 Sent i-3icmcZEm9ifBRnS4atoL6x]: [L1 Format empty-column] Empty value in column HEAD.
[Line 731 Sent i-3icmcZEm9ifBRnS4atoL6x]: [L1 Format empty-column] Empty value in column DEPREL.
Format errors: 2
*** FAILED *** with 2 errors
Is produced by this sentence, where #i-3hfqJbRtFjSBjLhYyKrNbN
is not linked in the dependency tree:
<s xml:id="i-3icmcZEm9ifBRnS4atoL6x">
<w msd="UPosTag=PRON|Case=Nom|Definite=Def|Gender=Com,Neut|Number=Plur" lemma="de" xml:id="i-3hfq9S2V3Fm6TvifRUQUF6">De</w>
<!-- not in tree: -->
<w msd="UPosTag=VERB|Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act" lemma="skapa" xml:id="i-3hfqJbRtFjSBjLhYyKrNbN">skapar</w>
<w msd="UPosTag=ADV|Degree=Cmp" lemma="snarare" xml:id="i-3hfqSvXLdcxT4dEBHjiHTS">snarare</w>
<w msd="UPosTag=ADJ|Case=Nom|Definite=Def,Ind|Degree=Pos|Gender=Com,Neut|Number=Plur" lemma="ny" xml:id="i-3hfqcvEgggnNGAfL52kBHv">nya</w>
<w msd="UPosTag=CCONJ" lemma="och" xml:id="i-3hfqn5e5uATTXaeDctC5eC">och</w>
<w msd="UPosTag=ADJ|Case=Nom|Definite=Ind|Degree=Pos|Gender=Com,Neut|Number=Plur" lemma="fler" xml:id="i-3hfqx5MRxEHNj85NQBDyUg">fler</w>
<w msd="UPosTag=NOUN|Case=Nom|Definite=Ind|Gender=Neut|Number=Plur" lemma="problem" xml:id="i-3hfr7EkqAhxTzY4Fx2fspx">problem</w>
<w msd="UPosTag=ADP" lemma="av" xml:id="i-3hfrFZrHYbUjKpatGSXnh2">av</w>
<w msd="UPosTag=DET|Definite=Ind|Gender=Com,Neut|Number=Plur,Sing" lemma="samma" xml:id="i-3hfrRZZdbfJeXN233jZgXW">samma</w>
<w msd="UPosTag=NOUN|Abbr=Yes" lemma="art" xml:id="i-3hfraiy2p8yjnmzvbb1asn">art.</w>
<w msd="UPosTag=ADV" lemma="därför" xml:id="i-3hfrmYzKhnxUvStLcKdUCU">Därför</w>
<w msd="UPosTag=VERB|Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act" lemma="ha" xml:id="i-3hfrut5n5gUkFjQxvjVP4Y">har</w>
<w msd="UPosTag=PRON|Case=Nom|Definite=Def|Gender=Com|Number=Plur" lemma="vi" xml:id="i-3hfs5so88kJfTGr7i2XGu2" join="right">vi</w>
<pc msd="UPosTag=PUNCT" xml:id="i-3hfsFsWUBp8aepHGVKZAjW">,</pc>
<w msd="UPosTag=NOUN|Case=Nom|Definite=Ind|Gender=Com|Number=Sing" lemma="herr" xml:id="i-3hfsR2usQHofvEGA3B155n">herr</w>
<w msd="UPosTag=NOUN|Case=Nom|Definite=Ind|Gender=Com|Number=Sing" lemma="talman" xml:id="i-3hfsZN1KnBKwFWnnMarywr" join="right">talman</w>
<pc msd="UPosTag=PUNCT" xml:id="i-3hfsjMifqF9rT4Dw8stsnL">,</pc>
<w msd="UPosTag=VERB|VerbForm=Sup|Voice=Act" lemma="välja" xml:id="i-3hfsuMS1tJymebf5vAvmcp">valt</w>
<w msd="UPosTag=PART" lemma="att" xml:id="i-3hft3gXUGCW2ytBiEangUt">att</w>
<w msd="UPosTag=VERB|VerbForm=Inf|Voice=Act" lemma="yrka" xml:id="i-3hftCqvsUgB8FJAbnSEaqA">yrka</w>
<w msd="UPosTag=NOUN|Case=Nom|Definite=Ind|Gender=Neut|Number=Sing" lemma="avslag" xml:id="i-3hftNqeDXk13SqbkZjGUfe">avslag</w>
<w msd="UPosTag=ADP" lemma="på" xml:id="i-3hftYqMZaopxeP2uM2JNW8">på</w>
<w msd="UPosTag=NOUN|Case=Nom|Definite=Def|Gender=Com|Number=Sing" lemma="proposition" xml:id="i-3hfthzkxoHW3uo1ntskGrQ" join="right">propositionen</w>
<pc msd="UPosTag=PUNCT" xml:id="i-3hftrKrRBB2KF5YRDHcBiU">.</pc>
<linkGrp targFunc="head argument" type="UD-SYN">
<link ana="ud-syn:nsubj" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfq9S2V3Fm6TvifRUQUF6"/>
<link ana="ud-syn:advmod" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfqSvXLdcxT4dEBHjiHTS"/>
<link ana="ud-syn:dep" target="#i-3hfsR2usQHofvEGA3B155n #i-3hfqcvEgggnNGAfL52kBHv"/>
<link ana="ud-syn:amod" target="#i-3hfsjMifqF9rT4Dw8stsnL #i-3hfqn5e5uATTXaeDctC5eC"/>
<link ana="ud-syn:dep" target="#i-3hfsR2usQHofvEGA3B155n #i-3hfqx5MRxEHNj85NQBDyUg"/>
<link ana="ud-syn:obj" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfr7EkqAhxTzY4Fx2fspx"/>
<link ana="ud-syn:nmod" target="#i-3hfsjMifqF9rT4Dw8stsnL #i-3hfrFZrHYbUjKpatGSXnh2"/>
<link ana="ud-syn:det" target="#i-3hftCqvsUgB8FJAbnSEaqA #i-3hfrRZZdbfJeXN233jZgXW"/>
<link ana="ud-syn:advmod" target="#i-3hfsuMS1tJymebf5vAvmcp #i-3hfraiy2p8yjnmzvbb1asn"/>
<link ana="ud-syn:advmod" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfrmYzKhnxUvStLcKdUCU"/>
<link ana="ud-syn:nsubj" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfs5so88kJfTGr7i2XGu2"/>
<link ana="ud-syn:punct" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfsFsWUBp8aepHGVKZAjW"/>
<link ana="ud-syn:det" target="#i-3hfsZN1KnBKwFWnnMarywr #i-3hfsR2usQHofvEGA3B155n"/>
<link ana="ud-syn:obj" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfsZN1KnBKwFWnnMarywr"/>
<link ana="ud-syn:punct" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfsjMifqF9rT4Dw8stsnL"/>
<link ana="ud-syn:dep" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hfsuMS1tJymebf5vAvmcp"/>
<link ana="ud-syn:obj" target="#i-3hfsuMS1tJymebf5vAvmcp #i-3hft3gXUGCW2ytBiEangUt"/>
<link ana="ud-syn:punct" target="#i-3hft3gXUGCW2ytBiEangUt #i-3hftCqvsUgB8FJAbnSEaqA"/>
<link ana="ud-syn:obj" target="#i-3hftCqvsUgB8FJAbnSEaqA #i-3hftNqeDXk13SqbkZjGUfe"/>
<link ana="ud-syn:advmod" target="#i-3hftCqvsUgB8FJAbnSEaqA #i-3hftYqMZaopxeP2uM2JNW8"/>
<link ana="ud-syn:advmod" target="#i-3hftYqMZaopxeP2uM2JNW8 #i-3hfthzkxoHW3uo1ntskGrQ"/>
<link ana="ud-syn:punct" target="#i-3hfrut5n5gUkFjQxvjVP4Y #i-3hftrKrRBB2KF5YRDHcBiU"/>
<link ana="ud-syn:root" target="#i-3icmcZEm9ifBRnS4atoL6x #i-3hfrut5n5gUkFjQxvjVP4Y"/>
</linkGrp>
</s>
# sent_id = i-3icmcZEm9ifBRnS4atoL6x
# text = De skapar snarare nya och fler problem av samma art. Därför har vi, herr talman, valt att yrka avslag på propositionen.
1 De de PRON _ Case=Nom|Definite=Def|Gender=Com,Neut|Number=Plur 12 nsubj _ NER=O
2 skapar skapa VERB _ Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act _ NER=O
3 snarare snarare ADV _ Degree=Cmp 12 advmod _ NER=O
4 nya ny ADJ _ Case=Nom|Definite=Def,Ind|Degree=Pos|Gender=Com,Neut|Number=Plur 15 dep _ NER=O
5 och och CCONJ _ _ 17 amod _ NER=O
6 fler fler ADJ _ Case=Nom|Definite=Ind|Degree=Pos|Gender=Com,Neut|Number=Plur 15 dep _ NER=O
7 problem problem NOUN _ Case=Nom|Definite=Ind|Gender=Neut|Number=Plur 12 obj _ NER=O
8 av av ADP _ _ 17 nmod _ NER=O
9 samma samma DET _ Definite=Ind|Gender=Com,Neut|Number=Plur,Sing 20 det _ NER=O
10 art. art NOUN _ Abbr=Yes 18 advmod _ NER=O
11 Därför därför ADV _ _ 12 advmod _ NER=O
12 har ha VERB _ Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Act 0 root _ NER=O
13 vi vi PRON _ Case=Nom|Definite=Def|Gender=Com|Number=Plur 12 nsubj _ NER=O|SpaceAfter=No
14 , , PUNCT _ _ 12 punct _ NER=O
15 herr herr NOUN _ Case=Nom|Definite=Ind|Gender=Com|Number=Sing 16 det _ NER=O
16 talman talman NOUN _ Case=Nom|Definite=Ind|Gender=Com|Number=Sing 12 obj _ NER=O|SpaceAfter=No
17 , , PUNCT _ _ 12 punct _ NER=O
18 valt välja VERB _ VerbForm=Sup|Voice=Act 12 dep _ NER=O
19 att att PART _ _ 18 obj _ NER=O
20 yrka yrka VERB _ VerbForm=Inf|Voice=Act 19 punct _ NER=O
21 avslag avslag NOUN _ Case=Nom|Definite=Ind|Gender=Neut|Number=Sing 20 obj _ NER=O
22 på på ADP _ _ 20 advmod _ NER=O
23 propositionen proposition NOUN _ Case=Nom|Definite=Def|Gender=Com|Number=Sing 22 advmod _ NER=O|SpaceAfter=No
24 . . PUNCT _ _ 12 punct _ NER=O
@ninpnin BTW, you can now see that using semantically senseless IDs makes the debugging more complicated.
@matyaskopp Nahh, it's pretty easy to grep those things. Ofc it's a matter of taste. The problem for me is that I haven't fixed the seed, and thus the IDs change every time I regenerate the corpus.
BTW I found the issue, like 20 words in the whole corpus were incorrectly tagged as abbrevations which screwed up the sentence. Everything passes locally now, I expect to finally finish this thing today.
@TomazErjavec Here's the updated full corpus https://github.com/ninpnin/ParlaMint/releases/tag/v2.1.3
Thanks, the log at https://nl.ijs.si/et/tmp/ParlaMint/Repo/ParlaMint-SE.log. We still have no. of words and covid date (which I fix, so ok), the parties without name (which you say are missing from source) and short dates (which I though you fixed). Still, just warnings, so ok. Importantly CoNLL-U looks good!
So, unless @matyaskopp protests, I think you are good for 3.0.
So, unless @matyaskopp protests, I think you are good for 3.0.
No protest
@TomazErjavec, I checked the ParlaMint-SE-log, and the end of the file is strange, but I guess it does not cause any trouble:
make jvert-one
make[2]: Entering directory '/home/project/corpora/Parla/ParlaMint/V3'
/project/corpora/Parla/ParlaMint/ParlaMint/Scripts/join-verts.pl -codes HU -in Master -out Verts
INFO: ***Joining HU
find: ‘/home/project/corpora/Parla/ParlaMint/V3/Master/ParlaMint-HU.vert’: No such file or directory
cp: cannot stat '/home/project/corpora/Parla/ParlaMint/V3/Master/ParlaMint-HU.vert/*_hu.regi': No such file or directory
make[2]: Leaving directory '/home/project/corpora/Parla/ParlaMint/V3'
make pack-one
make[2]: Entering directory '/home/project/corpora/Parla/ParlaMint/V3'
/project/corpora/Parla/ParlaMint/ParlaMint/Scripts/pack-parlamint.pl -codes 'HU' -in Master -out Transfer
INFO: ***Packing HU
INFO: *Packing ParlaMint-HU.TEI, ParlaMint-HU.txt
INFO: *Packing ParlaMint-HU.TEI.ana, ParlaMint-HU.conllu, ParlaMint-HU.vert
WARN: No ana root file, skipping
rsync -av Transfer/ParlaMint-HU.* tomaz@nl.ijs.si:/home/tomaz/www/tmp/ParlaMint/Repo
sending incremental file list
ParlaMint-HU.tgz
sent 49,941 bytes received 87,238 bytes 39,194.00 bytes/sec
total size is 155,053,949 speedup is 1,130.30
make[2]: Leaving directory '/home/project/corpora/Parla/ParlaMint/V3'
rsync -av ParlaMint-SE.log tomaz@nl.ijs.si:/home/tomaz/www/tmp/ParlaMint/Repo
sending incremental file list
No protest
Great, @ninpnin, feel free to close.
@TomazErjavec, I checked the ParlaMint-SE-log, and the end of the file is strange, but I guess it does not cause any trouble:
Yes, I know. I started running HU before SE was finished but, indeed, no harm done.
@TomazErjavec it seems I can't close issues here
closing
component filenames
ParlaMint-SE_YYYY-MM-DD<suffix without '_'>.xml
/TEI/@xml:id
can you please rename component files according to the recommendations: 2.3. File names and directory structure
wrong
meeting
text contentteiCorpus//meeting/text()
https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.xml#L8-L10
missing Swedish translations in taxonomies
remove unused taxonomies
taxonomy[@xml:id="parla.links"]
I guess you can remove this taxonomy, it was used in CZ corpus and it seems that you don't use it. https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.xml#L265-L279
wrong date in corpus root setting
setting/date
element valuesetting/date/@ana
setting/date
should contain the timespan of the corpus (from
-to
), and if you want to add@ana
attribute, it should contain a list of terms https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.xml#L316Wierd event label
https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.xml#L324-L333
invalid date in parliament organization
from
should start beforeto
.Thanks for this bug. It seems that our validation is not paranoic enough. (@matyaskopp, extend validation)
https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.xml#L328
missing term in parliament organization
There should be three terms in parliament organization. Expecting it owing to:
missing opposition relation
Do you have opposition in the Swedish parliament?
split forename
if someone has multiple names, each should have its own element https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.ana.xml#L602
should be
component file meeting
The meeting element in the component file should specify the content of file (eg use
parla.sitting
it it contains a sitting day) https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE_201516--19.xml#L8CZ sample: https://github.com/clarin-eric/ParlaMint/blob/47a6a842d5a6447266f3ce0d95ad83bdac66673e/Data/ParlaMint-CZ/ParlaMint-CZ_2016-04-13-ps2013-044-02-013-114.xml#L13-L16
debates beginning
It is possible that I don't understand it. Sittings in your data start with a weird sequence of unknown speakers and notes. @TomazErjavec can you help me with the feedback here? https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE_202122--29.xml#L193
Some notes look similar to some notes...
and even the linguistic annotation is weird for this situations:
missing chairperson
chair
role.speeches split by paragraphs
You are starting a new utterance whenever a new paragraph starts. There is no speaker change... https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE_201516--19.xml#L140-L149
I don't understand the usage of
@next
(referring to the following speech - notu
) andprev
(referring to the first elementu
of a sequence ofu
elements that creates one speech)