Linguistic Annotations - Githubissues

matyaskopp commented 1 year ago

I haven't run bin/ana_work_stanza.py, I have only checked ParlaMint 2.1 result ParlaMint.ana and the python source code. I can see the following problems:

annotation script does not work with annotated notes (gap / vocal / kinesic / incident)
it removes all pb elements, so it produces a different file (an unannotated TEI file is not reconstructible from TEI.ana file)

in ParlaMint.ana/ParlaMint-ES_2020-11-18-CD201118-bis.ana.xml, pb is preserved, the result due to incorrect XML parsing:

<w lemma="losipbar" msd="UPosTag=VERB|Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin" xml:id="ParlaMint-ES_2020-11-18-CD201118-bis.u2.1.2.15">los&lt;pb</w>
<w join="right" lemma="n=" msd="UPosTag=NOUN" xml:id="ParlaMint-ES_2020-11-18-CD201118-bis.u2.1.2.16">n=</w>
<pc join="right" msd="UPosTag=PUNCT|PunctType=Quot" xml:id="ParlaMint-ES_2020-11-18-CD201118-bis.u2.1.2.17">"</pc>
<w join="right" lemma="7" msd="UPosTag=NUM|NumForm=Digit|NumType=Card" xml:id="ParlaMint-ES_2020-11-18-CD201118-bis.u2.1.2.18">7</w>
<pc join="right" msd="UPosTag=PUNCT|PunctType=Quot" xml:id="ParlaMint-ES_2020-11-18-CD201118-bis.u2.1.2.19">"</pc>
<w lemma="&gt;&lt;/pb&gt;" msd="UPosTag=ADJ|Gender=Masc|Number=Plur" xml:id="ParlaMint-ES_2020-11-18-CD201118-bis.u2.1.2.20">&gt;&lt;/pb&gt;</w>

annotation script produces wrong sentence segmentation when note is present inside sentence, eg this situation: https://github.com/calzada/PARLAMINT-ES-MC/blob/0b3a40e98797470f441ec5ad18bedfeb8fb35e3c/ParlaMint/ParlaMint-ES_2020-01-04-CD200104.xml#L560

                 <w lemma="el" msd="UPosTag=DET|Definite=Def|Gender=Fem|Number=Plur|PronType=Art" xml:id="ParlaMint-ES_2020-01-04-CD200104.u71.4.1.27">las</w>
                 <w lemma="gracias" msd="UPosTag=NOUN|Gender=Fem|Number=Plur" xml:id="ParlaMint-ES_2020-01-04-CD200104.u71.4.1.28">gracias</w>
                 <linkGrp targFunc="head argument" type="UD-SYN">
                    <!-- SKIPPING -->
                 </linkGrp>
              </s>
              <note>rumores</note>
              <s xml:id="ParlaMint-ES_2020-01-04-CD200104.u71.4.3">
                 <pc msd="UPosTag=PUNCT|PunctType=Comm" xml:id="ParlaMint-ES_2020-01-04-CD200104.u71.4.3.1">,</pc>
                 <w lemma="porque" msd="UPosTag=SCONJ" xml:id="ParlaMint-ES_2020-01-04-CD200104.u71.4.3.2">porque</w>

There is a lot of work to fix the annotating script, so I suggest using scripts from ParCzech project. It uses Lindat UDPipe with spanish-ancora-ud-2.10-220711 model and NameTag annotation services, and it has been successfully reused in ParlaMint-AT and ParlaMint-UA corpora, so I believe it will work for ParlaMint-ES too. My raw time estimation for annotation is 1-2 days.

@calzada @lucianadmacedo, what do you think? Should I integrate this annotation in Makefile and run it when TEI version is ready?

lucianadmacedo commented 1 year ago

Hi @matyaskopp,

I think it would be worth a try. I'd annotate a sample file with UDPipe, as you mentioned, and see if the errors don't outweigh the ones generated by the current script. If they do, it would be better to implement those improvements into the already existing script. What do you think?

The annotation with the current script using stanza is taking an hour or so per file using my machine. So it'd be better to have something more efficient. Fingers crossed!

calzada commented 1 year ago

Hi @matyaskopp, @lucianadmacedo is our expert for annotation. But if you need anything, let us know. Best for now, mc

calzada commented 1 year ago

Hi @matyaskopp @lucianadmacedo has managed to run the script https://github.com/calzada/PARLAMINT-ES-MC/blob/master/bin/ana_work_stanza.py However, her GPU is not very powerful and the script takes on forever. Would you be able to run the script yourself. I know this is very bad on our side, but we have basic computer systems. What do you think? Best mc

matyaskopp commented 1 year ago

I will annotate the sample with UDPipe and NameTag tomorrow.

calzada commented 1 year ago

Hi, @matyaskopp Thanks so much for your help. Could you document the process (in a basic manner since I am a total novice) so that for the next stage I know how to do it myself. A simple list with commands will be appreciate it. i will try to replicate and if I cannot succeed, I will let you know. I think this is really a bore, but I will compensate in any way you may think of. At any rate, if you cannot document, do not worry. The important thing is that the annotation is ready. After annotation are there any further tasks to be performed? Best for now and thanks a zillion, mc

calzada commented 1 year ago

@lucianadmacedo That the script is working is absolutely fab. You are a star as usual. Nothing stops you. Let's see if UD Pipe works. Alternatively, we will find a way to run you great script. Best for now, mc

matyaskopp commented 1 year ago

The annotation with the current script using stanza is taking an hour or so per file using my machine. So it'd be better to have something more efficient. Fingers crossed!

@lucianadmacedo, I have 2.5 minutes per file - XML parsing on my laptop and annotating with LINDAT service. I have to implement the finalization script. I will then upload sample to ParlaMint.ana.sample and update this pull request https://github.com/clarin-eric/ParlaMint/pull/692

After annotation are there any further tasks to be performed?

@calzada, yes, #35

calzada / PARLAMINT-ES-MC

Linguistic Annotations #33