clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
40 stars 53 forks source link

ES-CT: linguistics annotations #639

Open matyaskopp opened 1 year ago

matyaskopp commented 1 year ago

@rjzevallos, I compared TEI and TEI.ana versions of one file. I don't know how complicated it is to fix these issues - I will be delighted if they are fixed or at least documented because these bugs can decrease corpus usability.

Syntactic words and join collision

TEI version:

<seg xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8" xml:lang="ca"><!-- 
... --> Hi havia, com vostè ha recordat, el segon tripartit. <!-- ... --></seg>

TEI.ana version:

<seg xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8" xml:lang="ca">
<!-- ... -->
    <s xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3" xml:lang="ca">
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.1" msd="UPosTag=PRON|PronType=Prs|Person=3" lemma="hi">Hi</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.2" msd="UPosTag=VERB|Mood=Ind|Tense=Imp|Person=3|Number=Sing" lemma="heure" join="right">havia</w>
        <pc xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.3" msd="UPosTag=PUNCT|PunctType=Comm">,</pc>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.4" msd="UPosTag=SCONJ" lemma="com">com</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.5" msd="UPosTag=PRON|PronType=Prs|Person=2|Number=Sing|Polite=Form" lemma="vostè">vostè</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.6" msd="UPosTag=AUX|Mood=Ind|Tense=Pres|Person=3|Number=Sing" lemma="haver">ha</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.7" msd="UPosTag=VERB|Mood=Par|Number=Sing|Gender=Masc" lemma="recordar" join="right">recordat</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.10">
           recordatel
           <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.8" msd="UPosTag=PUNCT|PunctType=Comm" norm="," lemma=","/>
           <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.9" msd="UPosTag=DET|PronType=Art|Gender=Masc|Number=Sing" norm="el" lemma="el"/>
        </w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.11" msd="UPosTag=NOUN|Gender=Masc|Number=Sing" lemma="segon">segon</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.12" msd="UPosTag=VERB|Mood=Par|Number=Sing|Gender=Masc" lemma="tripartir" join="right">tripartit</w>
        <pc xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.8.3.13" msd="UPosTag=PUNCT|PunctType=Peri">.</pc>
        <linkGrp type="UD-SYN" targFunc="head argument">
          <!-- ... -->
        </linkGrp>
    </s>
<!-- ... -->
</seg>

TEI version contains sentence:

Hi havia, com vostè ha recordat, el segon tripartit.

but TEI.ana version contains different a sentence:

Hi havia, com vostè ha recordatrecordatel segon tripartit.

Missing UD features in named entities, wrong UPosTag

<name type="MISC">
   <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.28" msd="UPosTag=PROPN" lemma="reglament">Reglament</w>
   <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.29" msd="UPosTag=PROPN" lemma="de">de</w>
   <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.30" msd="UPosTag=PROPN" lemma="el">el</w>
   <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.31" msd="UPosTag=PROPN" lemma="parlament">Parlament</w>
   <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.32" msd="UPosTag=PROPN" lemma="de">de</w>
   <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.33" msd="UPosTag=PROPN" lemma="catalunya">Catalunya</w>
</name>

No syntactic words in named entities + missing join right

TEI version:

               <seg xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3" xml:lang="ca"><!-- ... --> Reglament del Parlament de Catalunya.</seg>

TEI.ana version:

<seg xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3" xml:lang="ca">
    <s xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1" xml:lang="ca">
        <name type="MISC">
           <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.28" msd="UPosTag=PROPN" lemma="reglament">Reglament</w>
           <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.29" msd="UPosTag=PROPN" lemma="de">de</w>
           <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.30" msd="UPosTag=PROPN" lemma="el">el</w>
           <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.31" msd="UPosTag=PROPN" lemma="parlament">Parlament</w>
           <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.32" msd="UPosTag=PROPN" lemma="de">de</w>
           <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.33" msd="UPosTag=PROPN" lemma="catalunya">Catalunya</w>
        </name>
        <pc xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.3.1.34" msd="UPosTag=PUNCT|PunctType=Peri">.</pc>
    </s>
</seg>

TEI version contains:

Reglament del Parlament de Catalunya.

but TEI.ana version contains:

Reglament de el Parlament de Catalunya .

Missing join right in articles

TEI:

<seg xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1" xml:lang="ca">D’acord amb l’article 146

TEI.ana

<seg xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1" xml:lang="ca">
    <s xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1.1" xml:lang="ca">
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1.1.1" msd="UPosTag=ADP|AdpType=Prep" lemma="de">D'</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1.1.2" msd="UPosTag=NOUN|Gender=Masc|Number=Sing" lemma="acord">acord</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1.1.3" msd="UPosTag=ADP|AdpType=Prep" lemma="amb">amb</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1.1.4" msd="UPosTag=DET|PronType=Art|Number=Sing" lemma="el">l'</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1.1.5" msd="UPosTag=NOUN|Gender=Masc|Number=Sing" lemma="article">article</w>
        <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.1.0.1.1.6" msd="UPosTag=NUM" lemma="146">146</w>

TEI version contains:

D’acord amb l’article 146

but TEI.ana version contains:

D’ acord amb l’ article 146

Syntactic words at the beginning of sentence ???

TEI:

<seg xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25" xml:lang="ca"><!-- 
... --> Dels tribunals ordinaris de la justícia catalana. <!-- ... --></seg>

TEI.ana:

<seg xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25" xml:lang="ca">
<!-- ... --> 
    <s xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5" xml:lang="ca">
       <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.1" msd="UPosTag=ADP|AdpType=Prep" lemma="de">De</w>
       <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.2" msd="UPosTag=DET|PronType=Art|Gender=Masc|Number=Plur" lemma="el">els</w>
       <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.3" msd="UPosTag=NOUN|Gender=Masc|Number=Plur" lemma="tribunal">tribunals</w>
       <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.4" msd="UPosTag=ADJ|Gender=Masc|Number=Plur" lemma="ordinari">ordinaris</w>
       <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.5" msd="UPosTag=ADP|AdpType=Prep" lemma="de">de</w>
       <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.6" msd="UPosTag=DET|PronType=Art|Gender=Fem|Number=Sing" lemma="el">la</w>
       <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.7" msd="UPosTag=NOUN|Number=Sing" lemma="justícia">justícia</w>
       <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.8" msd="UPosTag=ADJ|Gender=Fem|Number=Sing" lemma="català" join="right">catalana</w>
       <pc xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.25.5.9" msd="UPosTag=PUNCT|PunctType=Peri">.</pc>
<!-- ... -->
    </s>
<!-- ... -->
</seg>

TEI version contains:

Dels tribunals ordinaris de la justícia catalana.

but TEI.ana version contains:

De els tribunals ordinaris de la justícia catalana.

misplaced join right

TEI:

... fixin-s’hi en ...

TEI.ana

    <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.37.2.67">
    fixin-s
    <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.37.2.65" msd="UPosTag=VERB|Mood=Sub|Tense=Pres|Person=3|Number=Plur" norm="fixin" lemma="fixar"/>
    <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.37.2.66" msd="UPosTag=PRON" norm="-s" lemma="es"/>
    </w>
    <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.37.2.68" msd="UPosTag=PRON|PronType=Prs|Person=3" lemma="hi" join="right">'hi</w>
    <w xml:id="ParlaMint-ES-CT_2015-11-10-0302.4.0.37.2.69" msd="UPosTag=ADP|AdpType=Prep" lemma="en">en</w>

TEI: fixin-s’hi en vs TEI.ana: fixin-s ’hien

nuriabel commented 1 year ago

Dear Matyás, Thanks a lot for your careful analysis. Yes, we are going to correct most of the errors you found, but for the Named Entities analysis. We will keep you updated. Best regards N.

El mié, 26 abr 2023 a las 10:20, Matyáš Kopp @.***>) escribió:

Assigned #639 https://github.com/clarin-eric/ParlaMint/issues/639 to @nuriabel https://github.com/nuriabel.

— Reply to this email directly, view it on GitHub https://github.com/clarin-eric/ParlaMint/issues/639#event-9100435882, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGFJPVYMD53FHSTV55CJFQLXDDLGHANCNFSM6AAAAAAXMCOQFM . You are receiving this because you were assigned.Message ID: @.***>

maartenpt commented 9 months ago

contraction seem not to have been dealt with properly in general, not only at the beginning of a word - take the word "del", which is never split into de+el, and has a wide range of upos tags:

PARLAMINT-31-PARLAMINT-ES-CT> Matches = [form="del"];
PARLAMINT-31-PARLAMINT-ES-CT> group Matches match upos;
#---------------------------------------------------------------------
(all)                         NOUN                               95942
                              ADJ                                12833
                              PUNCT                              10153
                              ADP                                 9990
                              VERB                                8936
                              ADV                                 8071
                              PROPN                               8013
                              NUM                                 4084
                              CCONJ                               3954
                              DET                                 2099
                              AUX                                 1382
                              PRON                                 915
                              SCONJ                                308
                              INTJ                                   1
matyaskopp commented 9 months ago

annotated version: image

<seg xml:id="ParlaMint-ES-CT_2020-10-22-6402.4.0.2" xml:lang="ca">
<!-- ... -->
                  <s xml:id="ParlaMint-ES-CT_2020-10-22-6402.4.0.2.7" xml:lang="ca">
<!-- ... -->
<w xml:id="ParlaMint-ES-CT_2020-10-22-6402.4.0.2.7.3" msd="UPosTag=NUM" lemma="15/100" join="right">15_per_cent</w>
<!-- ... -->
<w xml:id="ParlaMint-ES-CT_2020-10-22-6402.4.0.2.7.14" msd="UPosTag=NUM" lemma="82/100">82_per_cent</w>
</seg>

I understand that you have probably use it to fix wrong tokenization, but you forget to remove underscores _


same issue different unit:

image