RO Feedback - Githubissues

matyaskopp commented 1 year ago

meeting element

[x] extend meeting elements (#parla.term, #parla.sitting)

I haven't found any information about terms or sitting in the meeting elements. This is how other corpora implement it: https://github.com/clarin-eric/ParlaMint/blob/197e5ecf057a5ed53db6375421d78ffaf4e1c45c/Data/ParlaMint-UA/ParlaMint-UA_2014-12-02-m0.xml#L11-L13

I was not able to find term info on Romanian parliament websites - I believe the information is there. And if a single file contains one sitting, then add sitting identification.

Missing speech content

[ ] speech content

In some files there is no speech content: https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-09-04-id4959.xml

        <note type="time">Şedinţa a început la ora 15,55.</note>
        <note type="chairman">Lucrările au fost conduse de domnul Ion Diaconescu, preşedintele Camerei Deputaţilor, asistat de domnii Andrei Ioan Chiliman şi Acsinte Gaspar, secretari.</note>
        <note type="speaker">Domnul Ion Diaconescu:</note>
        <u ana="#chair" who="#Ion-Diaconescu" xml:id="ParlaMint-RO_2000-09-04-id4959.u1"/>
        <note type="speaker">Domnul Iuliu Ioan Furo:</note>

but the source contains speech contents: https://www.cdep.ro/pls/steno/steno2015.stenograma?ids=4959&idl=1#S0

Chairman note type

[x] use narrative or president According to doc, narrative or president fits better in this case: https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-09-04-id4959.xml#L125
```
    <note type="chairman">Lucrările au fost conduse de domnul Ion Diaconescu, preşedintele Camerei Deputaţilor, asistat de domnii Andrei Ioan Chiliman şi Acsinte Gaspar, secretari.</note>
```

not recognized notes

[x] notes in text

Notes are in source italics so easy to recognize...

https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-04-14-id4927.xml#L474

<seg xml:id="ParlaMint-RO_2000-04-14-id4927.u39.seg6">Cine este pentru?(Vociferără în partea dreaptă a sălii).Vă rog să număraţi... Vă rog să ridicaţi mâna, cei care sunteţi pentru acest amendament, să repetăm numărătoarea. Este o confuzie.</seg>

should be: (https://clarin-eric.github.io/ParlaMint/#TEI.vocal)

<seg xml:id="ParlaMint-RO_2000-04-14-id4927.u39.seg6">Cine este pentru? <vocal type="shouting">
    <desc>(Vociferără în partea dreaptă a sălii)</desc>
  </vocal> Vă rog să număraţi... Vă rog să ridicaţi mâna, cei care sunteţi pentru acest amendament, să repetăm numărătoarea. Este o confuzie.</seg>

presence list

[ ] presence list is missing status

https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-04-14-id4927.xml#L510-L513

        <u ana="#regular" who="#Andrei-Ioan-Chiliman" xml:id="ParlaMint-RO_2000-04-14-id4927.u46">
          <seg xml:id="ParlaMint-RO_2000-04-14-id4927.u46.seg1">Achimescu Victor Ştefan</seg>
          <seg xml:id="ParlaMint-RO_2000-04-14-id4927.u46.seg2">Aferăriţei Constantin</seg>
          <seg xml:id="ParlaMint-RO_2000-04-14-id4927.u46.seg3">Afrăsinei Viorica</seg>

corpus timespan

[x] corpus timespan bibl
[x] corpus timespan setting
[x] corpus timespan it would be nice to have it in text content of corpus title too

https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO.xml#L72

        <bibl>
          <title type="main" xml:lang="en">Meeting minutes of the Romanian Parliament</title>
          <title type="main" xml:lang="ro">Stenograme ale şedinţelor din Parlamentul României</title>
          <idno type="URI">http://www.parlament.ro/</idno>
          <date from="2000-02-01" to="2020-11-24">2000-02-01 - 2020-11-24</date>
        </bibl>

https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO.xml#L252

        <setting>
          <name type="city">Bucharest</name>
          <name type="place">Palace of the Parliament</name>
          <date from="2000-02-01" to="2020-11-24"/>
        </setting>

setting element

[x] setting element in root file

root file setting element should correspond to component ones (missing country)

https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO.xml#L249-L253

        <setting>
          <name type="city">Bucharest</name>
          <name type="place">Palace of the Parliament</name>
          <date from="2000-02-01" to="2020-11-24"/>
        </setting>

vs: https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO_2000-04-14-id4927.xml#L97-L101

        <setting>
          <name type="city">Bucharest</name>
          <name type="country" key="RO">Romania</name>
          <date when="2000-04-14" ana="#parla.sitting">14.04.2000</date>
        </setting>

capitalize surname

[x] dont capitalize surname

https://github.com/romanian-parlamint/ParlaMint/blob/8439dd75ca3c31b89f06bac23eff736a72a6ed6a/Data/ParlaMint-RO/ParlaMint-RO.xml#L384
```
          <surname>GORGHIU</surname>
```
should be
```
          <surname>Gorghiu</surname>
```

sort component files

[x] sort component files

The component files should be ordered according to the contents' date.

taxonomies

[x] translations
[x] wrong language context - English content in xml:lang="ro"
[x] missing descriptions

RePierre commented 1 year ago

Changed the capitalization of surnames with commit 51787f7.

RePierre commented 1 year ago

Sorted component files in commit be08d9a.

RePierre commented 1 year ago

Changed note type to narrative with commit 9fe5f43.

RePierre commented 1 year ago

Converted notes into more specific elements within segments with commit cc386af.

matyaskopp commented 1 year ago

Spaces around notes

[x] spaces around notes inside text

Converted notes into more specific elements within segments with commit cc386af.

You have removed spaces around notes which can cause troubles in tokenization... It can happen that the note is inside the token (= unexpected behaviour of my annotation script). https://github.com/romanian-parlamint/ParlaMint/blob/cc386afc90e1298cb4f4d79f44d5558949e4eeae/Data/ParlaMint-RO/ParlaMint-RO_2000-04-14-id4927.xml#L472

<seg xml:id="ParlaMint-RO_2000-04-14-id4927.u39.seg6">Cine este pentru?<vocal type="shouting"><desc>(Vociferără în partea dreaptă a sălii).</desc></vocal>Vă <!-- ... --> confuzie.</seg>

Should be:

<seg xml:id="ParlaMint-RO_2000-04-14-id4927.u39.seg6">Cine este pentru? <vocal type="shouting">
  <desc>(Vociferără în partea dreaptă a sălii).</desc>
</vocal> Vă <!-- ... --> confuzie.</seg>

RePierre commented 1 year ago

Added spaces around notes with commit 79b08b1.

RePierre commented 1 year ago

wrong language context - English content in xml:lang="ro"

Can you please provide an example?

I ran find -type f -name *.xml -exec grep --color=auto -i -nH --null -e lang\=\"ro\" \{\} +, went over all results, and wasn't able to find English content. Maybe I'm missing something?

matyaskopp commented 1 year ago

Can you please provide an example?

I ran find -type f -name *.xml -exec grep --color=auto -i -nH --null -e lang\=\"ro\" \{\} +, went over all results, and wasn't able to find English content. Maybe I'm missing something?

Oh, sorry - your <teiCorpus> is in English context:

<teiCorpus xmlns="http://www.tei-c.org/ns/1.0" xml:lang="en" xml:id="ParlaMint-RO">

This is the only corpus that has it. I implicitly expected that it has xml:lang="ro"

To search language context of <term> I now used

java -cp /usr/share/java/saxon.jar net.sf.saxon.Query -xi:off \!method=adaptive -qs:'//*[name()="term" and ./ancestor::*[@xml:lang][1]/@xml:lang="ro"]' -s:ParlaMint-RO/ParlaMint-RO.xml
<term xmlns="http://www.tei-c.org/ns/1.0">Legislatură</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Unități geo-politice sau administrative</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Legislatură națională</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Organizație politică</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Camere</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Parlament bicameral</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Senat</term>
<term xmlns="http://www.tei-c.org/ns/1.0">Camera deputaților</term>

The majority language in teiCorpus is usually English, so you have it correctly according to the documentation:

@xml:lang is also a global attribute and gives the language code of the text content of the element; for the corpus root this does not (just) mean the content of its TEI header, but primarily the textual content of its XIncluded components. The convention is that language of the text content of an element is determined by the value of the first @xml:lang attribute on its ancestor axis. In cases where the content is multilingual, the language code should be of the majority language. When the proportion of the languages is about equal, then the mul code for multiple languages can also be used.

but it is common to have the corpus language...

@TomazErjavec Can be english preserved in teiCorpus here?

RePierre commented 1 year ago

Normalized setting element in corpus root file and component files and set corpus span with commit d343920.

Should resolve:

setting element in root file corpus timespan setting

TomazErjavec commented 1 year ago

@TomazErjavec Can be english preserved in teiCorpus here?

In practice I'd much rather not have an exception. So, teiCorpus and TEI should have @xml:lang="ro". But maybe teiHeader with @xml:lang="en" is legit?

RePierre commented 1 year ago

Changed language of the teiCorpus element in commit 548e357.

matyaskopp commented 1 year ago

Duplicite person

[x] duplicite person

Every person should have one record in listPerson: https://github.com/romanian-parlamint/ParlaMint/blob/548e3576054c9067aee43fb2275b879cac9ba806/Data/ParlaMint-RO/ParlaMint-RO.xml#L1306-L1324

          <person xml:id="Augustin-Lucian-Bolcas">
            <persName>
              <forename>Lucian</forename>
              <forename>Augustin</forename>
              <surname>Bolcaș</surname>
            </persName>
            <sex value="M"/>
            <affiliation ana="#RoParl.51" ref="#RoParl" role="member" from="2000-12-15" to="2004-11-30"/>
          </person>
          <person xml:id="Lucian-Augustin-Bolcas">
            <persName>
              <forename>Lucian</forename>
              <forename>Augustin</forename>
              <surname>Bolcaș</surname>
            </persName>
            <sex value="M"/>
            <affiliation ana="#RoParl.51" ref="#RoParl" role="member" from="2000-12-15" to="2004-11-30"/>
            <affiliation ana="#RoParl.52" ref="#RoParl" role="member" from="2004-12-19" to="2008-12-13"/>
          </person>

`Necunoscut Necunoscut` person's name

[ ] Necunoscut Necunoscut

first occurence: https://github.com/romanian-parlamint/ParlaMint/blob/548e3576054c9067aee43fb2275b879cac9ba806/Data/ParlaMint-RO/ParlaMint-RO.xml#L6030

          <person xml:id="Dan-Dumitrescu">
            <persName>
              <forename>Necunoscut</forename>
              <surname>Necunoscut</surname>
            </persName>
            <sex value="U"/>
            <affiliation ana="#RoParl.55" ref="#RoParl" role="member" from="2016-12-21" to="2020-12-20"/>
          </person>

RePierre commented 1 year ago

Missing speech content

As suggested by @TomazErjavec, added <gap> elements to the utterances without segments in commit 0082dd3.

RePierre commented 1 year ago

Duplicite person

Fixed duplicate person with commit ac9a2bc.

RePierre commented 1 year ago

corpus timespan bibl

Included corpus timespan in <bibl> element with commit 70b7fc2.

RePierre commented 1 year ago

corpus timespan it would be nice to have it in text content of corpus title too

Included corpus span in corpus subtitle with commit df3879b.

RePierre commented 1 year ago

presence list is missing status

As discussed in the meeting on April 12, we cannot provide the presence list in time for this version because this requires changes in the crawlers of the session transcripts. I will try to include this data into a future version of the corpus.

RePierre commented 1 year ago

extend meeting elements (#parla.term, #parla.sitting)

Extended meeting elements with term and sitting information with commit 75affa9.

matyaskopp commented 1 year ago

[x] include annotated component files

Error: /home/runner/work/ParlaMint/ParlaMint/ParlaMint/Data/ParlaMint-RO/ParlaMint-RO_2015-09-29-id7560.xml:132:189: error: text not allowed here; expected element "gap", "incident", "kinesic", "note", "pb", "s" or "vocal"

@RePierre, you include unannotated files (TEI) in annotated (TEI.ana) root file: https://github.com/romanian-parlamint/ParlaMint/blob/459b829a1e053df1e22502222324d246be1c9a47/Data/ParlaMint-RO/ParlaMint-RO.ana.xml#L3018-L3027 eg

<xsi:include xmlns:xsi="http://www.w3.org/2001/XInclude" href="ParlaMint-RO_2015-09-29-id7560.xml"/>

should be

<xsi:include xmlns:xsi="http://www.w3.org/2001/XInclude" href="ParlaMint-RO_2015-09-29-id7560.ana.xml"/>

RePierre commented 1 year ago

include annotated component files

Included proper component files in commit 90da93b.

matyaskopp commented 1 year ago

@RePierre, thanks for the progress.

I have spotted an issue in the TEI.ana version of the files:

wrongly placed notes in the TEI.ana version

[ ] notes are placed at the beginning of seg
[ ] unannotated text after the first note

Data/ParlaMint-RO/ParlaMint-RO_2015-09-29-id7560.ana.xml:6433:284: error: text not allowed here; expected the element end-tag or element "gap", "incident", "kinesic", "note", "pb", "s" or "vocal"

TEI: (https://github.com/romanian-parlamint/ParlaMint/blob/5f986e2cc79e3f28347c6a655416c7f4f4d57a1c/Data/ParlaMint-RO/ParlaMint-RO_2015-09-29-id7560.xml#L284)

<seg xml:id="ParlaMint-RO_2015-09-29-id7560.u11.seg8">Cred <!--
... 
--> salariile. <vocal type="noise"><desc>(Aplauze.)</desc></vocal> Însă<!--
...
--> toţi.</seg>

TEI.ana:

<seg xml:id="ParlaMint-RO_2015-09-29-id7560.u11.seg8"><vocal type="noise"><desc>(Aplauze.)</desc></vocal> Însă<!--
...
-->toţi.<s xml:id="ParlaMint-RO_2015-09-29-id7560.u11.seg8.1">
  <w xml:id="ParlaMint-RO_2015-09-29-id7560.u11.seg8.1.1" lemma="Cred" pos="Vmip1s" msd="UPosTag=AUX|Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin">Cred</w>
<!--... -->
</s>
<!--... -->
</seg>

matyaskopp commented 1 year ago

Unrecognized full-paragraph note

[ ] "full-paragraph" notes

https://github.com/romanian-parlamint/ParlaMint/blob/a510c149ba04407fe6df77414b3a2aaec6f47022/Data/ParlaMint-RO/ParlaMint-RO_2006-09-18-id6154.xml#L422-L424

  <seg xml:id="ParlaMint-RO_2006-09-18-id6154.u33.seg8">Mulţumesc.</seg>
  <seg xml:id="ParlaMint-RO_2006-09-18-id6154.u33.seg9">(Domnul Valeriu Ştefan Zgonea se îndreaptă spre prezidiu.)</seg>
</u>

should be:

  <seg xml:id="ParlaMint-RO_2006-09-18-id6154.u33.seg8">Mulţumesc.</seg>
</u>
<note type="narrative">(Domnul Valeriu Ştefan Zgonea se îndreaptă spre prezidiu.)</note>

Other occurrences in sample data:

DataForks/ParlaMint-RO/ParlaMint-RO_2006-09-18-id6154.xml:411:          <seg xml:id="ParlaMint-RO_2006-09-18-id6154.u32.seg4">(Domnul Valeriu Ştefan Zgonea părăseşte prezidiul şi se îndreaptă spre tribună.)</seg>
DataForks/ParlaMint-RO/ParlaMint-RO_2006-09-18-id6154.xml:423:          <seg xml:id="ParlaMint-RO_2006-09-18-id6154.u33.seg9">(Domnul Valeriu Ştefan Zgonea se îndreaptă spre prezidiu.)</seg>

matyaskopp commented 1 year ago

U+0096 (SPA) Unicode Character

[ ] remove <0x0096> character

This character is allowed in ParlaMint, but it causes problems in linguistic annotations, I suggest removing it from the text: https://github.com/romanian-parlamint/ParlaMint/blob/a510c149ba04407fe6df77414b3a2aaec6f47022/Data/ParlaMint-RO/ParlaMint-RO_2000-10-24-id4980.xml#L148

<seg xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5">După <!--
...
--> urgie  1940. Dar n-a fost să fie aşa.</seg>

<w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.29" lemma="" pos="Ncm--n" msd="UPosTag=NOUN|Definite=Ind|Gender=Masc"></w>

matyaskopp commented 1 year ago

Named entities

[ ] named entities contains non-proper names

I guess you are using a model that labels not only named entities from PER/LOC/ORG/MISC set but also DATE and probably other labels. Something like this: https://huggingface.co/dumitrescustefan/bert-base-romanian-ner And you map all non-proper names to the MISC category, eg

<name type="MISC">
  <w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.23" lemma="acel" pos="Dd3msr---e" msd="UPosTag=DET|Case=Acc,Nom|Gender=Masc|Number=Sing|Person=3|Position=Prenom|PronType=Dem">acel</w>
  <w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.24" lemma="an" pos="Ncms-n" msd="UPosTag=NOUN|Definite=Ind|Gender=Masc|Number=Sing">an</w>
</name>

or

<name type="MISC">
  <w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.30" lemma="1940" pos="Mc-s-d" msd="UPosTag=">1940</w>
</name>

The year 1940 is not a proper name, so it shouldn't be surrounded by <name>. It is better to use <date> There are two options to solve this

remove named entities that are not proper names (DATETIME, PERIOD, MONEY, QUANTITY, ...)
find inspiration in the CZ corpus and use the proper tags. See mapping: https://github.com/ufal/ParCzech/issues/95#issuecomment-779237221

We are under time pressure, so I suggest using option (1) for ParlaMint3.0, and you can possibly improve it in ParlaMint3.1 (create RO special taxonomy, use proper elements and add ana attribute) @TomazErjavec ??

matyaskopp commented 1 year ago

shifted NEs ?

[ ] shifted NEs

In this paragraph (ParlaMint-RO_2000-10-24-id4980.u2.seg8.2), NEs seem to be shifted. https://raw.githubusercontent.com/clarin-eric/ParlaMint/3f2d0a820d31aa7e55b72156089a3450b303e3bc/Data/ParlaMint-RO/ParlaMint-RO_2000-10-24-id4980.ana.xml reformated and remove token elements (w and pc)

<s xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg8.2">
atitudinea autorităţilor ucrainene faţă de delegaţiile judeţului Suceava şi
<name type="MISC">Botoşani</name>
, la festivitatea dezvelirii
<name type="LOC">statuii</name>
lui
<name type="LOC">Eminescu</name>
, la Cernăuţi, în ziua de 15 iunie
<name type="LOC">2000</name>
; constrângerile
<name type="MISC">aduse în şcolile româneşti;</name>
coborârea unicului steag românesc de
<name type="MISC">pe</name>
clădirea sediului
<name type="LOC">redacţiei ziarului"</name>
Lumea"
<name type="MISC">;</name>
prezenţa la
<name type="MISC">manifestările româneşti a unor</name>
reprezentanţi gălăgioşi ai organizaţiilor
<name type="MISC">extremiste</name>
ucrainene; oprirea tinerilor etnici români,
<name type="MISC">în</name>
număr de
<name type="PER">200, de</name>
a veni la studii
<name type="MISC">în</name>
România, cu burse din partea statului
<name type="LOC">român</name>
şi altele.
</s>

matyaskopp commented 1 year ago

Voci din sală: in utterance

[ ] voice from the hall

https://github.com/romanian-parlamint/ParlaMint/blob/a510c149ba04407fe6df77414b3a2aaec6f47022/Data/ParlaMint-RO/ParlaMint-RO_2000-10-24-id4980.xml#L408-L414

<note type="speaker">Domnul Vasile Lupu:</note>
<u ana="#chair" who="#Vasile-Lupu" xml:id="ParlaMint-RO_2000-10-24-id4980.u37">
  <seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg1">Să vedem cine îl face. <vocal type="murmuring"><desc>(Rumoare în partea stângă a sălii)</desc></vocal> </seg>
  <seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg2">Dar, iată, se pare că nu s-a terminat şedinţa Biroului permanent.</seg>
  <seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg3">Voci din sală:</seg>
  <seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg4">S-a terminat de mult!</seg>
</u>

should be:

<note type="speaker">Domnul Vasile Lupu:</note>
<u ana="#chair" who="#Vasile-Lupu" xml:id="ParlaMint-RO_2000-10-24-id4980.u37">
  <seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg1">Să vedem cine îl face. <vocal type="murmuring"><desc>(Rumoare în partea stângă a sălii)</desc></vocal> </seg>
  <seg xml:id="ParlaMint-RO_2000-10-24-id4980.u37.seg2">Dar, iată, se pare că nu s-a terminat şedinţa Biroului permanent.</seg>
</u>
<note type="speaker">Voci din sală:</note>
<!-- no who attribute, ana is regular - expecting MP interrupting -->
<u ana="#regular" xml:id="ParlaMint-RO_2000-10-24-id4980.u38">
  <seg xml:id="ParlaMint-RO_2000-10-24-id4980.u38.seg1">S-a terminat de mult!</seg>
</u>

matyaskopp commented 1 year ago

person - affiliation - organization

[ ] parliamentary groups
[ ] only one virtual parliamentary group <orgName xml:lang="en" full="yes">Placeholder parliamentary group</orgName>
[ ] government

I guess you are aware of this. I just wanted it to be recorded

  INFO[10]  Total number of affiliations with RoParl: 256
  INFO[10]  Total number of affiliations with RoGov: 0
  Error: ERROR[10]  government-role organisation without affiliation: #RoGov
  INFO[10]  Total number of affiliations with RoParl.All: 0
  WARN[10]  parliamentaryGroup-role organisation without affiliation: #RoParl.All
  INFO[12]  Total number of organizations with parliament role: 1
  INFO[12]  Total number of organizations with government role: 1
  INFO[12]  Total number of organizations with parliamentaryGroup role: 1
  INFO[??]  Total number of affiliations 256
  INFO[??]  Total number of NO-role affiliations 0
  INFO[??]  Total number of 'member' role affiliations 256

RePierre commented 1 year ago

wrongly placed notes in the TEI.ana version

Fixed with commit 6662ec4.

RePierre commented 1 year ago

remove <0x0096> character

Removed in commit 69a116e.

matyaskopp commented 1 year ago

strange UPosTag `_` when `Mc-s-d`

[ ] UPosTag of digit tokens Mc-s-d

Every token with pos="Mc-s-d" has wrong msd="UPosTag=_". sample:

<w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.2" 
   lemma="1990"
   pos="Mc-s-d"
   msd="UPosTag=_">1990</w>

You can fix this with msd="UPosTag=NUM" or msd="UPosTag=NUM|NumForm=Digit"

<w xml:id="ParlaMint-RO_2000-10-24-id4980.u2.seg5.1.2" 
   lemma="1990"
   pos="Mc-s-d"
   msd="UPosTag=NUM|NumForm=Digit">1990</w>

strange UPosTag `_` when `Mc-s-b`

[ ] UPosTag of digit tokens Mc-s-b

Here I suggest replacing _ with X

cat DataForks/ParlaMint-RO/ParlaMint-RO_*.ana.xml| grep 'UPosTag=_"' | grep -v 'pos="Mc.s.d"'

<w xml:id="ParlaMint-RO_2006-09-18-id6154.u31.seg3.1.73" lemma="29,4" pos="Mc-s-b" msd="UPosTag=_">29,4</w>
<w xml:id="ParlaMint-RO_2006-09-18-id6154.u31.seg7.1.14" lemma="29,4" pos="Mc-s-b" msd="UPosTag=_">29,4</w>
<w xml:id="ParlaMint-RO_2006-09-18-id6154.u76.seg2.1.1" lemma="Mie" pos="Mc-s-b" msd="UPosTag=_">Mie</w>
<w xml:id="ParlaMint-RO_2006-09-18-id6154.u136.seg18.1.2" lemma="31.III.2006" pos="Mc-s-b" msd="UPosTag=_">31.III.2006</w>
<w xml:id="ParlaMint-RO_2006-09-18-id6154.u153.seg5.1.52" lemma="Secuiesc" pos="Mc-s-b" msd="UPosTag=_">Secuiesc</w>
<w xml:id="ParlaMint-RO_2015-09-29-id7560.u60.seg7.1.18" lemma="207;voturi" pos="Mc-s-b" msd="UPosTag=_">207;voturi</w>
<w xml:id="ParlaMint-RO_2015-10-12-id7569.u48.seg9.1.12" lemma="2003/88" pos="Mc-s-b" msd="UPosTag=_">2003/88</w>
<w xml:id="ParlaMint-RO_2015-10-12-id7569.u96.seg2.2.15" lemma="2002/772" pos="Mc-s-b" msd="UPosTag=_">2002/772</w>
<w xml:id="ParlaMint-RO_2015-10-12-id7569.u156.seg16.1.25" lemma="2007-2013" pos="Mc-s-b" msd="UPosTag=_">2007-2013</w>
<w xml:id="ParlaMint-RO_2018-03-05-id7900.u7.seg11.1.1" lemma="Mie" pos="Mc-s-b" msd="UPosTag=_">Mie</w>
<w xml:id="ParlaMint-RO_2018-03-05-id7900.u45.seg8.1.1" lemma="Mie" pos="Mc-s-b" msd="UPosTag=_">Mie</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u70.seg2.1.34" lemma="30.06.2021" pos="Mc-s-b" msd="UPosTag=_">30.06.2021</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u91.seg2.1.36" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u92.seg2.1.40" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u92.seg3.1.7" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u92.seg6.1.6" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u92.seg6.1.47" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u92.seg12.1.7" lemma="29A" pos="Mc-s-b" msd="UPosTag=_">29A</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u118.seg6.1.41" lemma="27.548" pos="Mc-s-b" msd="UPosTag=_">27.548</w>
<w xml:id="ParlaMint-RO_2021-10-25-id8335.u126.seg4.1.30" lemma="1.579/2006" pos="Mc-s-b" msd="UPosTag=_">1.579/2006</w>
<w xml:id="ParlaMint-RO_2021-11-09-id8341.u96.seg3.2.49" lemma="1,5°C" pos="Mc-s-b" msd="UPosTag=_">1,5°C</w>

matyaskopp commented 1 year ago

No `join` attribute

[ ] join="right" is missing in TEI.ana

see documentation: https://clarin-eric.github.io/ParlaMint/#sec-ana-words

TomazErjavec commented 1 year ago

As RO won't be a part of 3.1, moving this to "future" milestone.

clarin-eric / ParlaMint

RO Feedback #626

meeting element

Missing speech content

Chairman note type

not recognized notes

presence list

corpus timespan

setting element

capitalize surname

sort component files

taxonomies

Spaces around notes

Duplicite person

`Necunoscut Necunoscut` person's name

wrongly placed notes in the TEI.ana version

Unrecognized full-paragraph note

U+0096 (SPA) Unicode Character

Named entities

shifted NEs ?

Voci din sală: in utterance

person - affiliation - organization

strange UPosTag `_` when `Mc-s-d`

strange UPosTag `_` when `Mc-s-b`

No `join` attribute

clarin-eric / ParlaMint

RO Feedback #626

meeting element

Missing speech content

Chairman note type

not recognized notes

presence list

corpus timespan

setting element

capitalize surname

sort component files

taxonomies

Spaces around notes

Duplicite person

Necunoscut Necunoscut person's name

wrongly placed notes in the TEI.ana version

Unrecognized full-paragraph note

U+0096 (SPA) Unicode Character

Named entities

shifted NEs ?

Voci din sală: in utterance

person - affiliation - organization

strange UPosTag _ when Mc-s-d

strange UPosTag _ when Mc-s-b

No join attribute

`Necunoscut Necunoscut` person's name

strange UPosTag `_` when `Mc-s-d`

strange UPosTag `_` when `Mc-s-b`

No `join` attribute