Closed matyaskopp closed 1 year ago
Thank you for the feedback and the keen eye, @matyaskopp! One question: We were trying to avoid having notes inside segments and utterances as per (our understanding of) the TEI schema. So, for the note to be non-splitting, then it should be something like this?
<seg xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.seg21">Miren, en termos globais, Galicia é das comunidades autónomas de España que está na metade inferior no seu gasto en investimento por habitante en sanidade. O investimento por persoa que temos en Galicia está na franxa inferior. Podemos reinterpretar este dato como cada un queira, pero esta é a realidade, hai máis comunidades, algunhas, bastantes máis, por riba que por debaixo. Permítanme que llelo ensine nun gráfico
<note>(O señor Torrado amosa un gráfico.)</note>
e, ademais, permítanme que o faga porque hai un detalle aí que me parece moi interesante. Este é o gráfico de investimento por habitante para os orzamentos de 2022 en España. Vostedes poden ver as máis altas, están aquí sinaladas en verde: Euskadi, Asturias e Navarra; e poden ver Galicia na metade inferior, e poden ver as últimas tres.</seg>
Or can the note only be inside the "u" but dividing the "seg" like here?
One question: We were trying to avoid having notes inside segments and utterances as per (our understanding of) the TEI schema. So, for the note to be non-splitting, then it should be something like this?
<seg xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.seg21">Miren, en termos globais, Galicia é das comunidades autónomas de España que está na metade inferior no seu gasto en investimento por habitante en sanidade. O investimento por persoa que temos en Galicia está na franxa inferior. Podemos reinterpretar este dato como cada un queira, pero esta é a realidade, hai máis comunidades, algunhas, bastantes máis, por riba que por debaixo. Permítanme que llelo ensine nun gráfico <note>(O señor Torrado amosa un gráfico.)</note> e, ademais, permítanme que o faga porque hai un detalle aí que me parece moi interesante. Este é o gráfico de investimento por habitante para os orzamentos de 2022 en España. Vostedes poden ver as máis altas, están aquí sinaladas en verde: Euskadi, Asturias e Navarra; e poden ver Galicia na metade inferior, e poden ver as últimas tres.</seg>
Or can the note only be inside the "u" but dividing the "seg" like here?
If the note (short note) is in the middle of a paragraph, then you should keep it there. (but end and beginning notes should be outside of paragraph)
<seg>If the note <note>(short note)</note> is in the middle of a paragraph, then you should keep it there.</seg>
<note>(but end and beginning notes should be outside of paragraph)</note>
and if the note is first/last in the utterance, then it should be outside too.
Your sample is almost correct. It is better to use space around the note - it is safer for further linguistic annotation - better to have double space, then double new lines for sentence segmentation.
<seg>Miren, <!--...--> gráfico <note>(O señor Torrado amosa un gráfico.)</note> e, ademais, <!--...--> tres.</seg>
@adina-v one more improvement idea:
Currently, you have a similar URL in all component bibl/idno
elements. If it is possible it would be great to refer to each component file's proper source. It is a kind of proof that the data are real
Check CZ corpus, I am even adding links to u
and pb
Thank you! We´ll do what we can :)
If the note (short note) is in the middle of a paragraph, then you should keep it there. (but end and beginning notes should be outside of paragraph)
<seg>If the note <note>(short note)</note> is in the middle of a paragraph, then you should keep it there.</seg> <note>(but end and beginning notes should be outside of paragraph)</note>
and if the note is first/last in the utterance, then it should be outside too.
Sorry, @matyaskopp, we are still having doubts about the placement of notes. By "short note" inside a paragraph, do you mean all kinds of notes, or only those marked as
? What about notes that require a description, such as<vocal>
? So if we have a paragraph such as the following, would that be correct?<u who="#PradoMaríaMontserrat" ana="#regular" xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.u11"> [...] <seg xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.seg52">E podemos falar dos PAC, [...]</seg> <kinesic type="applause"> <desc>(Aplausos.)</desc> </kinesic> <seg xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.seg53">Logo esa é a atención que vostedes lle están dando á poboación,</seg> <kinesic type="applause"> <desc>(Aplausos.)</desc> </kinesic> <seg xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.seg54">¡Esa é atención que lle están dando á poboación!</seg> <kinesic type="applause"> <desc>(Aplausos.)</desc> </kinesic> <seg xml:id="ParlaMint-ES-GA_2022-04-04-DSPG076.seg55">E, polo tanto, evidentemente, hai que tomar en serio a defensa da sanidade pública [...]</seg> [...] </u>
And the following?
<note>(As señoras deputadas e os señores deputados e demais asistentes a esta sesión, postos en pé, gardan un minuto de silencio.)</note>
<u who="#SantalicesMiguelÁngel" ana="#chair" xml:id="ParlaMint-ES-GA_2020-08-07-DSPG001.u165">
<seg xml:id="ParlaMint-ES-GA_2020-08-07-DSPG001.seg241">E agora si, escoitamos o himno.</seg>
<seg xml:id="ParlaMint-ES-GA_2020-08-07-DSPG001.seg242">Moitas grazas.</seg>
<note>(As señoras deputadas e os señores deputados e demais asistentes á sesión, postos en pé, cantan o himno galego.)</note>
<kinesic type="applause">
<seg xml:id="ParlaMint-ES-GA_2020-08-07-DSPG001.seg243">Moitas grazas. Quedou constituído o Parlamento. Grazas a todos.</seg>
<kinesic type="applause">
<note type="time">Remata a sesión ás doce e trinta e oito minutos do mediodía.</note>
Thank you!
I hope this summarizes the idea:
, <incident>
, <kinesic>
should be handled in similar way <seg>
) - then it is better to place it outside <seg>
(or even outside utterance <u>
) documented here:<u>
- it is not very strict - every partner does it as they want:
if it is a long interruption such as break/signing hymn/...Thank you!
- you should preserve paragraphs as they are in the text - so if there is a note/incident in the middle of a paragraph, then you should place it in the very same place
Sorry, @matyaskopp, we are still debating on note placement and could really use your expertise. Could you please shed some light once and for all on this issue? Would it be acceptable to separate notes in the middle of paragraphs by new lines, like so,
<u who="#IglesiasCarmen" ana="#regular" xml:id="ParlaMint-ES-GA_2015-01-27-DSPG095.u67">
<seg xml:id="ParlaMint-ES-GA_2015-01-27-DSPG095.seg494">
...e das poucas disimuladas intencións de culpabilizar as pacientes
<vocal type="murmuring">
pois parece que enfermamos por riba das nosas posibilidades...</seg>
Or must such notes be placed in the same line as all the text of the <seg>
they belong to, like so
<note type="speaker">A señora IGLESIAS SUEIRO:</note>
<u who="#IglesiasCarmen" ana="#regular" xml:id="ParlaMint-ES-GA_2015-01-27-DSPG095.u67">
<seg xml:id="ParlaMint-ES-GA_2015-01-27-DSPG095.seg494">
...e das poucas disimuladas intencións de culpabilizar as pacientes <vocal type="murmuring"> <desc>(Murmurios.)</desc> </vocal> , pois parece que enfermamos por riba das nosas posibilidades...</seg>
Thank you in advance for any feedback on this!
The best solution is this (preserve spaces around incident/note that are in text):
<u who="#IglesiasCarmen" ana="#regular" xml:id="ParlaMint-ES-GA_2015-01-27-DSPG095.u67">
<seg xml:id="ParlaMint-ES-GA_2015-01-27-DSPG095.seg494"><!-- NO LEADING SPACE-->...e das poucas disimuladas intencións de culpabilizar as pacientes <vocal type="murmuring">
</vocal>, pois parece que enfermamos por riba das nosas posibilidades...<!-- NO TRAILING SPACE --></seg>
It is also safest for sentence segmentation. I don't know what tool are you using for tokenization, but eg in the UDPipe case, when you want to tokenize a sentence that contains a double new line inside. It can split this sentence into two sentences.
Thank you, @matyaskopp! We have updated the sample fixing the issues in the XML files. The .ana files are still in progress, so we´ll be updating those when we have them.
Thank you, @matyaskopp! We have updated the sample fixing the issues in the XML files. The .ana files are still in progress, so we´ll be updating those when we have them.
@adina-v, great thanks. You still have newlines around notes - I am not insisting on fixing this, but you should be careful in linguistic annotations - sentence segmentation and tokenization can be broken by it.
Sorry - we are aware of that. We looked into it, but with the way we had our scripts set up and the limited time available, it was very complicated to implement - so we decided to leave the newlines in place and be ready to fix this issue at the level of linguistic annotation.
Hi! We just updated the sample with the ana files - hope all is correct!
Hi! We just updated the sample with the ana files - hope all is correct!
@adina-v, great, thanks. Reported issues are fixed. There is always space for improvement, but it is time to stop :-). But I have spotted one final issue that affects only root files and can cause confusion in the concordancer.
in TEI rootlangUsage
in TEI.ana rootshould look like this:
But you are missing a Galician translation, and an English translation is wrongly set as Galician:
<language ident="gl" xml:lang="gl">Galician</language>
<language ident="en" xml:lang="en">English</language>
should be
<language ident="gl" xml:lang="gl"><!-- ??? --></language>
<language ident="en" xml:lang="gl"><!-- ??? --></language>
<language ident="gl" xml:lang="en">Galician</language>
<language ident="en" xml:lang="en">English</language>
Thank you for the observation and all the help so far! We have updated the TEI root and TEI ana root.
@adina-v, thanks for the fast fixings.
The Galician sample works for me, so if you agree, I can merge it into the data branch and close this issue, and you can jump into processing the whole corpus.
Great, thank you!
corpus timespan
funder see:
missing speaker notes
speaker note is missing:
notes inside sentences
this note should not split sentence/paragraph/utterance:
should be in one
If there is a note inside utterance, you automatically split this utterance into two
strange speech content
passive should be government in opposition relation
should be: