clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
51 stars 53 forks source link

ES-GA Feedback #621

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

corpus timespan

https://github.com/adina-v/ParlaMint/blob/308040600d3003048864710cd326255da0a415d5/Data/ParlaMint-ES-GA/ParlaMint-ES-GA.xml#L8-L9

<title type="sub" xml:lang="gl">Actas do Parlamento de Galicia, Lexislaturas IX-XI (2015 - 2021)</title>
<title type="sub" xml:lang="en">Minutes of the Galician Parliament, Terms 9-11 (2015 - 2021)</title>

versus:

            <bibl>
                <title type="main" xml:lang="gl">Actas do Parlamento de Galicia</title>
                <title type="main" xml:lang="en">Minutes of the Galician Parliament</title>
                <idno type="URI">https://www.parlamentodegalicia.gal/</idno>
                <date from="2015-01-27" to="2022-05-25">27.01.2015 - 25.05.2022</date>
            </bibl>
            <setting>
                <name type="address">Rúa do Hórreo, 63, 15702 Santiago de Compostela, A Coruña, Galicia, España</name>
                <name type="city">Santiago de Compostela</name>
                <name type="country" key="ES-GA">Galicia</name>
                <date from="2015-01-25" to="2022-05-25">25.01.2015 - 25.05.2022</date>
            </setting>

funder

https://github.com/adina-v/ParlaMint/blob/308040600d3003048864710cd326255da0a415d5/Data/ParlaMint-ES-GA/ParlaMint-ES-GA.xml#L63 see: https://github.com/clarin-eric/ParlaMint/blob/031ec3009386a4bfec60bf0e22f653a813ddf98c/Data/ParlaMint-CZ/ParlaMint-CZ.xml#L17-L20

missing speaker notes

https://github.com/adina-v/ParlaMint/blob/308040600d3003048864710cd326255da0a415d5/Data/ParlaMint-ES-GA/ParlaMint-ES-GA_2022-04-04-DSPG076.xml#L84-L90

<u who="#SantalicesMiguelÁngel" ana="#chair" xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.u1">
  <seg xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.seg1">Bos días. </seg>

speaker note is missing:

<note type="speaker">O señor PRESIDENTE:</note>
<u who="#SantalicesMiguelÁngel" ana="#chair" xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.u1">
  <seg xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.seg1">Bos días. </seg>

source: https://www.parlamentodegalicia.es/sitios/web/BibliotecaDiarioSesions/D110076.pdf image

notes inside sentences

this note should not split sentence/paragraph/utterance: https://github.com/adina-v/ParlaMint/blob/308040600d3003048864710cd326255da0a415d5/Data/ParlaMint-ES-GA/ParlaMint-ES-GA_2022-04-04-DSPG076.xml#L109-L113

    <seg xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.seg21">Miren, en termos globais, Galicia é das comunidades autónomas de España que está na metade inferior no seu gasto en investimento por habitante en sanidade. O investimento por persoa que temos en Galicia está na franxa inferior. Podemos reinterpretar este dato como cada un queira, pero esta é a realidade, hai máis comunidades, algunhas, bastantes máis, por riba que por debaixo. Permítanme que llelo ensine nun gráfico</seg>
</u>
<note>(O señor Torrado amosa un gráfico.)</note>
<u who="#TorradoJulio" ana="#regular" xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.u4">
    <seg xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.seg22">e, ademais, permítanme que o faga porque hai un detalle aí que me parece moi interesante. Este é o gráfico de investimento por habitante para os orzamentos de 2022 en España. Vostedes poden ver as máis altas, están aquí sinaladas en verde: Euskadi, Asturias e Navarra; e poden ver Galicia na metade inferior, e poden ver as últimas tres.</seg>

should be in one seg image

If there is a note inside utterance, you automatically split this utterance into two

strange speech content

https://github.com/adina-v/ParlaMint/blob/308040600d3003048864710cd326255da0a415d5/Data/ParlaMint-ES-GA/ParlaMint-ES-GA_2015-12-23-DSPG136.xml#L121-L122

<seg xml:id="ParlaMint-ES-GA_2015-12-23-DSPG136.seg23">PAGE    </seg>
<seg xml:id="ParlaMint-ES-GA_2015-12-23-DSPG136.seg24">PAGE   2</seg>

passive should be government in opposition relation

https://github.com/adina-v/ParlaMint/blob/308040600d3003048864710cd326255da0a415d5/Data/ParlaMint-ES-GA/ParlaMint-ES-GA-listOrg.xml#L558-L563

        <relation name="opposition" 
                  active="#party.AGE #party.PSdeG-PSOE #party.BNG" 
                  passive="#party.PPdeG" 
                  from="2012-11-16" 
                  to="2016-08-01" 
                  ana="#PG.9"/>

should be:

        <relation name="opposition" 
                  active="#party.AGE #party.PSdeG-PSOE #party.BNG" 
                  passive="GOV" 
                  from="2012-11-16" 
                  to="2016-08-01" 
                  ana="#PG.9"/>
adina-v commented 1 year ago

Thank you for the feedback and the keen eye, @matyaskopp! One question: We were trying to avoid having notes inside segments and utterances as per (our understanding of) the TEI schema. So, for the note to be non-splitting, then it should be something like this?

<seg xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.seg21">Miren, en termos globais, Galicia é das comunidades autónomas de España que está na metade inferior no seu gasto en investimento por habitante en sanidade. O investimento por persoa que temos en Galicia está na franxa inferior. Podemos reinterpretar este dato como cada un queira, pero esta é a realidade, hai máis comunidades, algunhas, bastantes máis, por riba que por debaixo. Permítanme que llelo ensine nun gráfico
<note>(O señor Torrado amosa un gráfico.)</note>
e, ademais, permítanme que o faga porque hai un detalle aí que me parece moi interesante. Este é o gráfico de investimento por habitante para os orzamentos de 2022 en España. Vostedes poden ver as máis altas, están aquí sinaladas en verde: Euskadi, Asturias e Navarra; e poden ver Galicia na metade inferior, e poden ver as últimas tres.</seg>

Or can the note only be inside the "u" but dividing the "seg" like here?

matyaskopp commented 1 year ago

One question: We were trying to avoid having notes inside segments and utterances as per (our understanding of) the TEI schema. So, for the note to be non-splitting, then it should be something like this?

<seg xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.seg21">Miren, en termos globais, Galicia é das comunidades autónomas de España que está na metade inferior no seu gasto en investimento por habitante en sanidade. O investimento por persoa que temos en Galicia está na franxa inferior. Podemos reinterpretar este dato como cada un queira, pero esta é a realidade, hai máis comunidades, algunhas, bastantes máis, por riba que por debaixo. Permítanme que llelo ensine nun gráfico
<note>(O señor Torrado amosa un gráfico.)</note>
e, ademais, permítanme que o faga porque hai un detalle aí que me parece moi interesante. Este é o gráfico de investimento por habitante para os orzamentos de 2022 en España. Vostedes poden ver as máis altas, están aquí sinaladas en verde: Euskadi, Asturias e Navarra; e poden ver Galicia na metade inferior, e poden ver as últimas tres.</seg>

Or can the note only be inside the "u" but dividing the "seg" like here?

If the note (short note) is in the middle of a paragraph, then you should keep it there. (but end and beginning notes should be outside of paragraph)

<seg>If the note <note>(short note)</note> is in the middle of a paragraph, then you should keep it there.</seg>
<note>(but end and beginning notes should be outside of paragraph)</note>

and if the note is first/last in the utterance, then it should be outside too.

Your sample is almost correct. It is better to use space around the note - it is safer for further linguistic annotation - better to have double space, then double new lines for sentence segmentation.

<seg>Miren, <!--...--> gráfico <note>(O señor Torrado amosa un gráfico.)</note> e, ademais, <!--...--> tres.</seg>
matyaskopp commented 1 year ago

@adina-v one more improvement idea:

proper link to source file

Currently, you have a similar URL in all component bibl/idno elements. If it is possible it would be great to refer to each component file's proper source. It is a kind of proof that the data are real Check CZ corpus, I am even adding links to u and pb elements: https://github.com/clarin-eric/ParlaMint/blob/031ec3009386a4bfec60bf0e22f653a813ddf98c/Data/ParlaMint-CZ/ParlaMint-CZ_2016-04-13-ps2013-044-02-013-114.xml#L54-L59 https://github.com/clarin-eric/ParlaMint/blob/031ec3009386a4bfec60bf0e22f653a813ddf98c/Data/ParlaMint-CZ/ParlaMint-CZ_2016-04-13-ps2013-044-02-013-114.xml#L161-L164 https://github.com/clarin-eric/ParlaMint/blob/031ec3009386a4bfec60bf0e22f653a813ddf98c/Data/ParlaMint-CZ/ParlaMint-CZ_2016-04-13-ps2013-044-02-013-114.xml#L174-L177

adina-v commented 1 year ago

Thank you! We´ll do what we can :)

adina-v commented 1 year ago

If the note (short note) is in the middle of a paragraph, then you should keep it there. (but end and beginning notes should be outside of paragraph)

<seg>If the note <note>(short note)</note> is in the middle of a paragraph, then you should keep it there.</seg>
<note>(but end and beginning notes should be outside of paragraph)</note>

and if the note is first/last in the utterance, then it should be outside too.

Sorry, @matyaskopp, we are still having doubts about the placement of notes. By "short note" inside a paragraph, do you mean all kinds of notes, or only those marked as <note>? What about notes that require a description, such as <vocal> or <kinesic>? So if we have a paragraph such as the following, would that be correct?

<u who="#PradoMaríaMontserrat" ana="#regular" xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.u11">
[...]
<seg xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.seg52">E podemos falar dos PAC, [...]</seg>
<kinesic type="applause">
<desc>(Aplausos.)</desc>
</kinesic>
<seg xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.seg53">Logo esa é a atención que vostedes lle están dando á poboación,</seg>
<kinesic type="applause">
<desc>(Aplausos.)</desc>
</kinesic>
<seg xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.seg54">¡Esa é atención que lle están dando á poboación!</seg> 
<kinesic type="applause">
<desc>(Aplausos.)</desc>
</kinesic>
<seg xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.seg55">E, polo tanto, evidentemente, hai que tomar en serio a defensa da sanidade pública [...]</seg>
[...]
</u>

sample-es-ga

And the following?

    <note>(As señoras deputadas e os señores deputados e demais asistentes a esta sesión, postos en pé, gardan un minuto de silencio.)</note>
    <u who="#SantalicesMiguelÁngel" ana="#chair" xml:id="ParlaMint-ES-GA_2020-08-07-DSPG001.u165">
        <seg xml:id="ParlaMint-ES-GA_2020-08-07-DSPG001.seg241">E agora si, escoitamos o himno.</seg>
        <seg xml:id="ParlaMint-ES-GA_2020-08-07-DSPG001.seg242">Moitas grazas.</seg>
    </u>
    <note>(As señoras deputadas e os señores deputados e demais asistentes á sesión, postos en pé, cantan o himno galego.)</note>
        <kinesic type="applause"> 
            <desc>(Aplausos.)</desc> 
        </kinesic>
        <seg xml:id="ParlaMint-ES-GA_2020-08-07-DSPG001.seg243">Moitas grazas. Quedou constituído o Parlamento. Grazas a todos.</seg>
    </u>
        <kinesic type="applause"> 
            <desc>(Aplausos.)</desc> 
        </kinesic>
    <note type="time">Remata a sesión ás doce e trinta e oito minutos do mediodía.</note>

sample-es-ga2

Thank you!

matyaskopp commented 1 year ago

I hope this summarizes the idea:

adina-v commented 1 year ago

Thank you!

adina-v commented 1 year ago
  • you should preserve paragraphs as they are in the text - so if there is a note/incident in the middle of a paragraph, then you should place it in the very same place

Sorry, @matyaskopp, we are still debating on note placement and could really use your expertise. Could you please shed some light once and for all on this issue? Would it be acceptable to separate notes in the middle of paragraphs by new lines, like so,

    <u who="#IglesiasCarmen" ana="#regular" xml:id="ParlaMint-ES-GA_2015-01-27-DSPG095.u67">
        <seg xml:id="ParlaMint-ES-GA_2015-01-27-DSPG095.seg494">
            ...e das poucas disimuladas intencións de culpabilizar as pacientes 
            <vocal type="murmuring">
                <desc>(Murmurios.)</desc>
            </vocal>
            pois parece que enfermamos por riba das nosas posibilidades...</seg>
    </u>

Or must such notes be placed in the same line as all the text of the <seg> they belong to, like so

<note type="speaker">A señora IGLESIAS SUEIRO:</note>
    <u who="#IglesiasCarmen" ana="#regular" xml:id="ParlaMint-ES-GA_2015-01-27-DSPG095.u67">
           <seg xml:id="ParlaMint-ES-GA_2015-01-27-DSPG095.seg494">
        ...e das poucas disimuladas intencións de culpabilizar as pacientes  <vocal type="murmuring"> <desc>(Murmurios.)</desc> </vocal>  , pois parece que enfermamos por riba das nosas posibilidades...</seg>
    </u>

Thank you in advance for any feedback on this!

matyaskopp commented 1 year ago

The best solution is this (preserve spaces around incident/note that are in text):

    <u who="#IglesiasCarmen" ana="#regular" xml:id="ParlaMint-ES-GA_2015-01-27-DSPG095.u67">
        <seg xml:id="ParlaMint-ES-GA_2015-01-27-DSPG095.seg494"><!-- NO LEADING SPACE-->...e das poucas disimuladas intencións de culpabilizar as pacientes <vocal type="murmuring">
                <desc>(Murmurios.)</desc>
            </vocal>, pois parece que enfermamos por riba das nosas posibilidades...<!-- NO TRAILING SPACE --></seg>
    </u>

It is also safest for sentence segmentation. I don't know what tool are you using for tokenization, but eg in the UDPipe case, when you want to tokenize a sentence that contains a double new line inside. It can split this sentence into two sentences.

see UA sample: https://github.com/clarin-eric/ParlaMint/blob/197e5ecf057a5ed53db6375421d78ffaf4e1c45c/Data/ParlaMint-UA/ParlaMint-UA_2014-12-02-m0.xml#L123-L125

adina-v commented 1 year ago

Thank you, @matyaskopp! We have updated the sample fixing the issues in the XML files. The .ana files are still in progress, so we´ll be updating those when we have them.

matyaskopp commented 1 year ago

Thank you, @matyaskopp! We have updated the sample fixing the issues in the XML files. The .ana files are still in progress, so we´ll be updating those when we have them.

@adina-v, great thanks. You still have newlines around notes - I am not insisting on fixing this, but you should be careful in linguistic annotations - sentence segmentation and tokenization can be broken by it.

adina-v commented 1 year ago

Sorry - we are aware of that. We looked into it, but with the way we had our scripts set up and the limited time available, it was very complicated to implement - so we decided to leave the newlines in place and be ready to fix this issue at the level of linguistic annotation.

adina-v commented 1 year ago

Hi! We just updated the sample with the ana files - hope all is correct!

matyaskopp commented 1 year ago

Hi! We just updated the sample with the ana files - hope all is correct!

@adina-v, great, thanks. Reported issues are fixed. There is always space for improvement, but it is time to stop :-). But I have spotted one final issue that affects only root files and can cause confusion in the concordancer.

language usage

should look like this: https://clarin-eric.github.io/ParlaMint/#exa-langUsage

But you are missing a Galician translation, and an English translation is wrongly set as Galician: https://github.com/adina-v/ParlaMint/blob/1e4ebbe388f362d2ce002c7bef95b34fb6810204/Data/ParlaMint-ES-GA/ParlaMint-ES-GA.xml#L192-L193

<langUsage>
    <language ident="gl" xml:lang="gl">Galician</language>
    <language ident="en" xml:lang="en">English</language>
</langUsage>

should be

<langUsage>
    <language ident="gl" xml:lang="gl"><!-- ??? --></language>
    <language ident="en" xml:lang="gl"><!-- ??? --></language>
    <language ident="gl" xml:lang="en">Galician</language>
    <language ident="en" xml:lang="en">English</language>
</langUsage>
adina-v commented 1 year ago

Thank you for the observation and all the help so far! We have updated the TEI root and TEI ana root.

matyaskopp commented 1 year ago

@adina-v, thanks for the fast fixings.

The Galician sample works for me, so if you agree, I can merge it into the data branch and close this issue, and you can jump into processing the whole corpus.

adina-v commented 1 year ago

Great, thank you!