Open matyaskopp opened 1 year ago
Converted notes into more specific elements within segments with commit cc386af.
Converted notes into more specific elements within segments with commit cc386af.
You have removed spaces around notes which can cause troubles in tokenization... It can happen that the note is inside the token (= unexpected behaviour of my annotation script). https://github.com/romanian-parlamint/ParlaMint/blob/cc386afc90e1298cb4f4d79f44d5558949e4eeae/Data/ParlaMint-RO/ParlaMint-RO_2000-04-14-id4927.xml#L472
<seg xml:id="ParlaMint-RO_2000-04-14-id4927.u39.seg6">Cine este pentru?<vocal type="shouting"><desc>(Vociferără în partea dreaptă a sălii).</desc></vocal>Vă <!-- ... --> confuzie.</seg>
Should be:
<seg xml:id="ParlaMint-RO_2000-04-14-id4927.u39.seg6">Cine este pentru? <vocal type="shouting">
<desc>(Vociferără în partea dreaptă a sălii).</desc>
</vocal> Vă <!-- ... --> confuzie.</seg>
wrong language context - English content in xml:lang="ro"
Can you please provide an example?
I ran find -type f -name *.xml -exec grep --color=auto -i -nH --null -e lang\=\"ro\" \{\} +
, went over all results, and wasn't able to find English content. Maybe I'm missing something?
Can you please provide an example?
I ran
find -type f -name *.xml -exec grep --color=auto -i -nH --null -e lang\=\"ro\" \{\} +
, went over all results, and wasn't able to find English content. Maybe I'm missing something?
Oh, sorry - your <teiCorpus>
is in English context:
<teiCorpus xmlns="http://www.tei-c.org/ns/1.0" xml:lang="en" xml:id="ParlaMint-RO">
This is the only corpus that has it. I implicitly expected that it has xml:lang="ro"
To search language context of <term>
I now used
java -cp /usr/share/java/saxon.jar net.sf.saxon.Query -xi:off \!method=adaptive -qs:'//*[name()="term" and ./ancestor::*[@xml:lang][1]/@xml:lang="ro"]' -s:ParlaMint-RO/ParlaMint-RO.xml
<term xmlns="http://www.tei-c.org/ns/1.0">Legislatură</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Unități geo-politice sau administrative</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Legislatură națională</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Organizație politică</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Camere</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Parlament bicameral</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Senat</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Camera deputaților</term>
The majority language in teiCorpus is usually English, so you have it correctly according to the documentation:
@xml:lang
is also a global attribute and gives the language code of the text content of the element; for the corpus root this does not (just) mean the content of its TEI header, but primarily the textual content of its XIncluded components. The convention is that language of the text content of an element is determined by the value of the first@xml:lang
attribute on its ancestor axis. In cases where the content is multilingual, the language code should be of the majority language. When the proportion of the languages is about equal, then the mul code for multiple languages can also be used.
but it is common to have the corpus language...
@TomazErjavec Can be english preserved in teiCorpus
here?
Normalized setting
element in corpus root file and component files and set corpus span with commit d343920.
Should resolve:
setting element in root file corpus timespan setting
@TomazErjavec Can be english preserved in teiCorpus here?
In practice I'd much rather not have an exception. So, teiCorpus
and TEI
should have @xml:lang="ro"
.
But maybe teiHeader with @xml:lang="en"
is legit?
Every person should have one record in listPerson: https://github.com/romanian-parlamint/ParlaMint/blob/548e3576054c9067aee43fb2275b879cac9ba806/Data/ParlaMint-RO/ParlaMint-RO.xml#L1306-L1324
<person xml:id="Augustin-Lucian-Bolcas">
<persName>
<forename>Lucian</forename>
<forename>Augustin</forename>
<surname>Bolcaș</surname>
</persName>
<sex value="M"/>
<affiliation ana="#RoParl.51" ref="#RoParl" role="member" from="2000-12-15" to="2004-11-30"/>
</person>
<person xml:id="Lucian-Augustin-Bolcas">
<persName>
<forename>Lucian</forename>
<forename>Augustin</forename>
<surname>Bolcaș</surname>
</persName>
<sex value="M"/>
<affiliation ana="#RoParl.51" ref="#RoParl" role="member" from="2000-12-15" to="2004-11-30"/>
<affiliation ana="#RoParl.52" ref="#RoParl" role="member" from="2004-12-19" to="2008-12-13"/>
</person>
Necunoscut Necunoscut
person's namefirst occurence: https://github.com/romanian-parlamint/ParlaMint/blob/548e3576054c9067aee43fb2275b879cac9ba806/Data/ParlaMint-RO/ParlaMint-RO.xml#L6030
<person xml:id="Dan-Dumitrescu">
<persName>
<forename>Necunoscut</forename>
<surname>Necunoscut</surname>
</persName>
<sex value="U"/>
<affiliation ana="#RoParl.55" ref="#RoParl" role="member" from="2016-12-21" to="2020-12-20"/>
</person>
Missing speech content
As suggested by @TomazErjavec, added <gap>
elements to the utterances without segments in commit 0082dd3.
corpus timespan bibl
Included corpus timespan in <bibl>
element with commit 70b7fc2.
corpus timespan it would be nice to have it in text content of corpus title too
Included corpus span in corpus subtitle with commit df3879b.
presence list is missing status
As discussed in the meeting on April 12, we cannot provide the presence list in time for this version because this requires changes in the crawlers of the session transcripts. I will try to include this data into a future version of the corpus.
extend meeting elements (#parla.term, #parla.sitting)
Extended meeting elements with term and sitting information with commit 75affa9.
Error: /home/runner/work/ParlaMint/ParlaMint/ParlaMint/Data/ParlaMint-RO/ParlaMint-RO_2015-09-29-id7560.xml:132:189: error: text not allowed here; expected element "gap", "incident", "kinesic", "note", "pb", "s" or "vocal"
@RePierre, you include unannotated files (TEI) in annotated (TEI.ana) root file: https://github.com/romanian-parlamint/ParlaMint/blob/459b829a1e053df1e22502222324d246be1c9a47/Data/ParlaMint-RO/ParlaMint-RO.ana.xml#L3018-L3027 eg
<xsi:include xmlns:xsi="http://www.w3.org/2001/XInclude" href="ParlaMint-RO_2015-09-29-id7560.xml"/>
should be
<xsi:include xmlns:xsi="http://www.w3.org/2001/XInclude" href="ParlaMint-RO_2015-09-29-id7560.ana.xml"/>
include annotated component files
Included proper component files in commit 90da93b.
@RePierre, thanks for the progress.
I have spotted an issue in the TEI.ana version of the files:
Data/ParlaMint-RO/ParlaMint-RO_2015-09-29-id7560.ana.xml:6433:284: error: text not allowed here; expected the element end-tag or element "gap", "incident", "kinesic", "note", "pb", "s" or "vocal"
<seg xml:id="ParlaMint-RO_2015-09-29-id7560.u11.seg8">Cred <!--
...
--> salariile. <vocal type="noise"><desc>(Aplauze.)</desc></vocal> Însă<!--
...
--> toţi.</seg>
TEI.ana:
<seg xml:id="ParlaMint-RO_2015-09-29-id7560.u11.seg8"><vocal type="noise"><desc>(Aplauze.)</desc></vocal> Însă<!--
...
-->toţi.<s xml:id="ParlaMint-RO_2015-09-29-id7560.u11.seg8.1">
<w xml:id="ParlaMint-RO_2015-09-29-id7560.u11.seg8.1.1" lemma="Cred" pos="Vmip1s" msd="UPosTag=AUX|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin">Cred</w>
<!--... -->
</s>
<!--... -->
</seg>
<seg xml:id="ParlaMint-RO_2006-09-18-id6154.u33.seg8">Mulţumesc.</seg>
<seg xml:id="ParlaMint-RO_2006-09-18-id6154.u33.seg9">(Domnul Valeriu Ştefan Zgonea se îndreaptă spre prezidiu.)</seg>
</u>
should be:
<seg xml:id="ParlaMint-RO_2006-09-18-id6154.u33.seg8">Mulţumesc.</seg>
</u>
<note type="narrative">(Domnul Valeriu Ştefan Zgonea se îndreaptă spre prezidiu.)</note>
Other occurrences in sample data:
DataForks/ParlaMint-RO/ParlaMint-RO_2006-09-18-id6154.xml:411: <seg xml:id="ParlaMint-RO_2006-09-18-id6154.u32.seg4">(Domnul Valeriu Ştefan Zgonea părăseşte prezidiul şi se îndreaptă spre tribună.)</seg>
DataForks/ParlaMint-RO/ParlaMint-RO_2006-09-18-id6154.xml:423: <seg xml:id="ParlaMint-RO_2006-09-18-id6154.u33.seg9">(Domnul Valeriu Ştefan Zgonea se îndreaptă spre prezidiu.)</seg>
This character is allowed in ParlaMint, but it causes problems in linguistic annotations, I suggest removing it from the text: https://github.com/romanian-parlamint/ParlaMint/blob/a510c149ba04407fe6df77414b3a2aaec6f47022/Data/ParlaMint-RO/ParlaMint-RO_2000-10-24-id4980.xml#L148
<seg xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5">După <!--
...
--> urgie 1940. Dar n-a fost să fie aşa.</seg>
<w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.29" lemma="" pos="Ncm--n" msd="UPosTag=NOUN|Definite=Ind|Gender=Masc"></w>
I guess you are using a model that labels not only named entities from PER/LOC/ORG/MISC set but also DATE and probably other labels. Something like this: https://huggingface.co/dumitrescustefan/bert-base-romanian-ner And you map all non-proper names to the MISC category, eg
<name type="MISC">
<w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.23" lemma="acel" pos="Dd3msr---e" msd="UPosTag=DET|Case=Acc,Nom|Gender=Masc|Number=Sing|Person=3|Position=Prenom|PronType=Dem">acel</w>
<w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.24" lemma="an" pos="Ncms-n" msd="UPosTag=NOUN|Definite=Ind|Gender=Masc|Number=Sing">an</w>
</name>
or
<name type="MISC">
<w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.30" lemma="1940" pos="Mc-s-d" msd="UPosTag=">1940</w>
</name>
The year 1940 is not a proper name, so it shouldn't be surrounded by <name>
. It is better to use <date>
There are two options to solve this
We are under time pressure, so I suggest using option (1) for ParlaMint3.0, and you can possibly improve it in ParlaMint3.1 (create RO special taxonomy, use proper elements and add ana
attribute)
@TomazErjavec ??
In this paragraph (ParlaMint-RO_2000-10-24-id4980.u2.seg8.2), NEs seem to be shifted.
https://raw.githubusercontent.com/clarin-eric/ParlaMint/3f2d0a820d31aa7e55b72156089a3450b303e3bc/Data/ParlaMint-RO/ParlaMint-RO_2000-10-24-id4980.ana.xml
reformated and remove token elements (w
and pc
)
<s xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg8.2">
atitudinea autorităţilor ucrainene faţă de delegaţiile judeţului Suceava şi
<name type="MISC">Botoşani</name>
, la festivitatea dezvelirii
<name type="LOC">statuii</name>
lui
<name type="LOC">Eminescu</name>
, la Cernăuţi, în ziua de 15 iunie
<name type="LOC">2000</name>
; constrângerile
<name type="MISC">aduse în şcolile româneşti;</name>
coborârea unicului steag românesc de
<name type="MISC">pe</name>
clădirea sediului
<name type="LOC">redacţiei ziarului"</name>
Lumea"
<name type="MISC">;</name>
prezenţa la
<name type="MISC">manifestările româneşti a unor</name>
reprezentanţi gălăgioşi ai organizaţiilor
<name type="MISC">extremiste</name>
ucrainene; oprirea tinerilor etnici români,
<name type="MISC">în</name>
număr de
<name type="PER">200, de</name>
a veni la studii
<name type="MISC">în</name>
România, cu burse din partea statului
<name type="LOC">român</name>
şi altele.
</s>
<note type="speaker">Domnul Vasile Lupu:</note>
<u ana="#chair" who="#Vasile-Lupu" xml:id="ParlaMint-RO_2000-10-24-id4980.u37">
<seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg1">Să vedem cine îl face. <vocal type="murmuring"><desc>(Rumoare în partea stângă a sălii)</desc></vocal> </seg>
<seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg2">Dar, iată, se pare că nu s-a terminat şedinţa Biroului permanent.</seg>
<seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg3">Voci din sală:</seg>
<seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg4">S-a terminat de mult!</seg>
</u>
should be:
<note type="speaker">Domnul Vasile Lupu:</note>
<u ana="#chair" who="#Vasile-Lupu" xml:id="ParlaMint-RO_2000-10-24-id4980.u37">
<seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg1">Să vedem cine îl face. <vocal type="murmuring"><desc>(Rumoare în partea stângă a sălii)</desc></vocal> </seg>
<seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg2">Dar, iată, se pare că nu s-a terminat şedinţa Biroului permanent.</seg>
</u>
<note type="speaker">Voci din sală:</note>
<!-- no who attribute, ana is regular - expecting MP interrupting -->
<u ana="#regular" xml:id="ParlaMint-RO_2000-10-24-id4980.u38">
<seg xml:id="ParlaMint-RO_2000-10-24-id4980.u38.seg1">S-a terminat de mult!</seg>
</u>
<orgName xml:lang="en" full="yes">Placeholder parliamentary group</orgName>
I guess you are aware of this. I just wanted it to be recorded
INFO[10] Total number of affiliations with RoParl: 256
INFO[10] Total number of affiliations with RoGov: 0
Error: ERROR[10] government-role organisation without affiliation: #RoGov
INFO[10] Total number of affiliations with RoParl.All: 0
WARN[10] parliamentaryGroup-role organisation without affiliation: #RoParl.All
INFO[12] Total number of organizations with parliament role: 1
INFO[12] Total number of organizations with government role: 1
INFO[12] Total number of organizations with parliamentaryGroup role: 1
INFO[??] Total number of affiliations 256
INFO[??] Total number of NO-role affiliations 0
INFO[??] Total number of 'member' role affiliations 256
wrongly placed notes in the TEI.ana version
Fixed with commit 6662ec4.
_
when Mc-s-d
Mc-s-d
Every token with pos="Mc-s-d"
has wrong msd="UPosTag=_"
.
sample:
<w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.2"
lemma="1990"
pos="Mc-s-d"
msd="UPosTag=_">1990</w>
You can fix this with msd="UPosTag=NUM"
or msd="UPosTag=NUM|NumForm=Digit"
<w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.2"
lemma="1990"
pos="Mc-s-d"
msd="UPosTag=NUM|NumForm=Digit">1990</w>
_
when Mc-s-b
Mc-s-b
Here I suggest replacing _
with X
cat DataForks/ParlaMint-RO/ParlaMint-RO_*.ana.xml| grep 'UPosTag=_"' | grep -v 'pos="Mc.s.d"'
<w xml:id="ParlaMint-RO_2006-09-18-id6154.u31.seg3.1.73" lemma="29,4" pos="Mc-s-b" msd="UPosTag=_">29,4</w>
<w xml:id="ParlaMint-RO_2006-09-18-id6154.u31.seg7.1.14" lemma="29,4" pos="Mc-s-b" msd="UPosTag=_">29,4</w>
<w xml:id="ParlaMint-RO_2006-09-18-id6154.u76.seg2.1.1" lemma="Mie" pos="Mc-s-b" msd="UPosTag=_">Mie</w>
<w xml:id="ParlaMint-RO_2006-09-18-id6154.u136.seg18.1.2" lemma="31.III.2006" pos="Mc-s-b" msd="UPosTag=_">31.III.2006</w>
<w xml:id="ParlaMint-RO_2006-09-18-id6154.u153.seg5.1.52" lemma="Secuiesc" pos="Mc-s-b" msd="UPosTag=_">Secuiesc</w>
<w xml:id="ParlaMint-RO_2015-09-29-id7560.u60.seg7.1.18" lemma="207;voturi" pos="Mc-s-b" msd="UPosTag=_">207;voturi</w>
<w xml:id="ParlaMint-RO_2015-10-12-id7569.u48.seg9.1.12" lemma="2003/88" pos="Mc-s-b" msd="UPosTag=_">2003/88</w>
<w xml:id="ParlaMint-RO_2015-10-12-id7569.u96.seg2.2.15" lemma="2002/772" pos="Mc-s-b" msd="UPosTag=_">2002/772</w>
<w xml:id="ParlaMint-RO_2015-10-12-id7569.u156.seg16.1.25" lemma="2007-2013" pos="Mc-s-b" msd="UPosTag=_">2007-2013</w>
<w xml:id="ParlaMint-RO_2018-03-05-id7900.u7.seg11.1.1" lemma="Mie" pos="Mc-s-b" msd="UPosTag=_">Mie</w>
<w xml:id="ParlaMint-RO_2018-03-05-id7900.u45.seg8.1.1" lemma="Mie" pos="Mc-s-b" msd="UPosTag=_">Mie</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u70.seg2.1.34" lemma="30.06.2021" pos="Mc-s-b" msd="UPosTag=_">30.06.2021</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u91.seg2.1.36" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u92.seg2.1.40" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u92.seg3.1.7" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u92.seg6.1.6" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u92.seg6.1.47" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u92.seg12.1.7" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u118.seg6.1.41" lemma="27.548" pos="Mc-s-b" msd="UPosTag=_">27.548</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u126.seg4.1.30" lemma="1.579/2006" pos="Mc-s-b" msd="UPosTag=_">1.579/2006</w>
<w xml:id="ParlaMint-RO_2021-11-09-id8341.u96.seg3.2.49" lemma="1,5°C" pos="Mc-s-b" msd="UPosTag=_">1,5°C</w>
join
attributejoin="right"
is missing in TEI.anasee documentation: https://clarin-eric.github.io/ParlaMint/#sec-ana-words
As RO won't be a part of 3.1, moving this to "future" milestone.
meeting element
#parla.term
,#parla.sitting
)I haven't found any information about terms or sitting in the meeting elements. This is how other corpora implement it: https://github.com/clarin-eric/ParlaMint/blob/197e5ecf057a5ed53db6375421d78ffaf4e1c45c/Data/ParlaMint-UA/ParlaMint-UA_2014-12-02-m0.xml#L11-L13
I was not able to find term info on Romanian parliament websites - I believe the information is there. And if a single file contains one sitting, then add sitting identification.
Missing speech content
In some files there is no speech content: https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-09-04-id4959.xml
but the source contains speech contents: https://www.cdep.ro/pls/steno/steno2015.stenograma?ids=4959&idl=1#S0
Chairman note type
narrative
orpresident
According to doc,narrative
orpresident
fits better in this case: https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-09-04-id4959.xml#L125not recognized notes
Notes are in source italics so easy to recognize...
https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-04-14-id4927.xml#L474
should be: (https://clarin-eric.github.io/ParlaMint/#TEI.vocal)
presence list
https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-04-14-id4927.xml#L510-L513
corpus timespan
bibl
setting
https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO.xml#L72
https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO.xml#L252
setting element
root file
setting
element should correspond to component ones (missing country)https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO.xml#L249-L253
vs: https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-04-14-id4927.xml#L97-L101
capitalize surname
[x] dont capitalize surname
https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO.xml#L384
should be
sort component files
The component files should be ordered according to the contents' date.
taxonomies
xml:lang="ro"