clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
43 stars 53 forks source link

BA+HR+RS feedback (TEI version) #536

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

term in meeting

https://github.com/5roop/ParlaMint/blob/0a4f83ce4bb79ee297aa0c74f687cd0a95a68025/Data/ParlaMint-HR/ParlaMint-HR.xml#L15 this should be removed

<meeting n="10" corresp="#HS" ana="#parla.term #HS.10">10. mandat</meeting>

There are no proceedings from the 10th term. It starts after the last date in corpus 2022-07-15

component file contains sitting

<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="bs" xml:id="ParlaMint-BA_1999-02-10-0" ana="#parla.term #reference">

should be

<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:lang="bs" xml:id="ParlaMint-BA_1999-02-10-0" ana="#parla.sitting #reference">

meeting element in the component file

The <meeting> element in the component files should be structured. https://github.com/5roop/ParlaMint/blob/d1185f4274bb6bd5c94efac73ddff3d069316aa7/Data/ParlaMint-BA/ParlaMint-BA_1999-02-10-0.xml#L10

<meeting n="T02S04" corresp="#PS" ana="#parla.term #PS.2">2. mandat, 04. sjednica</meeting>

should be

<meeting n="2" corresp="#PS" ana="#parla.uni #parla.term #PS.2">2. mandat</meeting>
<meeting n="4" corresp="#PS" ana="#parla.uni #parla.session">04. sjednica</meeting> <!-- or #parla.meeting -->
<meeting n="1999-02-10" corresp="#PS" ana="#parla.uni #parla.sitting">1999-02-10</meeting>

Session / meeting confusion

The title contains a word session, but I think the meeting is more proper (according to parla.legislature taxonomy). If you change session to meeting, you should also change this in <meeting> element (see above)

very few notes in the text

Checked only Bosnian as it was easiest to understand the language to me. But other corpora have a few notes too....

I checked one steno that corresponds to this file: https://github.com/5roop/ParlaMint/blob/d1185f4274bb6bd5c94efac73ddff3d069316aa7/Data/ParlaMint-BA/ParlaMint-BA_1999-07-08-0.xml source: https://www.parlament.ba/session/SessionDetails?id=2427 pdf: https://www.parlament.ba/session/DownloadDocument?DocumentId=e4141245-622a-45e2-a3bc-cec04f48a0f8&langTag=bs

I miss notes in the text, e.g.:

<div type="debateSection">
    <note type="speaker">HALID GENJAC:</note> <!-- missing -->
    <u who="#GenjacHalid" ana="#chair" n="ParlaMint-BIH_ZD.T2.S06.u1504" xml:id="ParlaMint-BA_1999-07-08-0.u1504">
        <seg xml:id="ParlaMint-BA_1999-07-08-0.u1504.seg0">Otvaram 6. sjednicu Predstavničkog ....</seg>
    </u>

image

Furthermore you can add whole initial section and use <head> and <note type="comment">

5roop commented 1 year ago

Thanks, @matyaskopp, will revise and resubmit PR.

nljubesi commented 1 year ago

@matyaskopp, should notes on names of speakers be present in Serbian and Croatian as well? The data sources used for Serbian and Croatian are these: Serbian:

Screenshot 2022-12-16 at 14 08 14

Croatian (the speaker is actually "Reiner, Željko") :

Screenshot 2022-12-16 at 14 16 28

I would be against adding the speaker notes to Serbian and Croatian. Honestly, I might prefer not adding notes to Bosnian neither, the speaker names are blended with the text due to the inability of the parliament to encode their metadata properly, but they do seem to be part of the main text.

TomazErjavec commented 1 year ago

I would be against adding the speaker notes to Serbian and Croatian.

I would tend to agree, so, unless @matyaskopp has serious reservations, I would say it is ok not to have these notes.

matyaskopp commented 1 year ago

I would be against adding the speaker notes to Serbian and Croatian.

I would tend to agree, so, unless @matyaskopp has serious reservations, I would say it is ok not to have these notes.

I am not happy about cropping these notes, but I leave it up to you. This should probably be documented in editorialDecl, but I am not sure in which section (correction or normalization ?)

5roop commented 1 year ago

All three languages have the rest of the feedback implemented. BA with included speaker-type notes is in this commit.

HR has been prepared in two versions: with speaker-type notes and without them.

Same goes for RS: with speaker-type notes and without them.

Other than a decision on the correct section to document the I'd also appreciate an approximate wording.

nljubesi commented 1 year ago

@matyaskopp How would the notes in Croatian and Serbian need to look like? Only names of the speakers or the full description available from the respective web pages?

Names @5roop already inserted, the whole description would require us to crawl everything from scratch, because we have obtained the data already crawled by an upstream project.

TomazErjavec commented 1 year ago

I think it is quite ok to have only the names of the speakers, let's not overcomplicate.

As for mentioning the fact that the full description is not included in the corpus in editorialDecl: I would not. Mostly because none of the elements there are meant from describing this, neither correction nor normalization. Also, it really is a minor detail, others, I'm sure, do more radical things without mentioning them.

5roop commented 1 year ago

Ok, so the relevant commits should be BA, HR, and RS. Those are also the latest commits on the branches.

It seems that I can't submit new PR while an old one is still open, do you need me to do anything else in order to facilitate the merge?

matyaskopp commented 1 year ago

It seems that I can't submit new PR while an old one is still open, do you need me to do anything else in order to facilitate the merge?

There is no need to open a new pull request. Once you open a pull request it is automatically synced and validated.

5roop commented 1 year ago

Ok, thanks. Do you want to inspect the sample again or can we go ahead with full-scale data preparation and submission?

matyaskopp commented 1 year ago

RS: notest are not annotated

you have removed some notes in this commit: https://github.com/5roop/ParlaMint/commit/11171ef75e4cd71d55b81c864112238a56b529ce and the rest of the notes are missing... https://github.com/5roop/ParlaMint/blob/448fb611e4132fda3af4c955b1fd6a492fc84d2c/Data/ParlaMint-RS/ParlaMint-RS_1997-12-03-0.xml#L133

I think everything in "()" is a note. Or at least these are clear ones:

5roop commented 1 year ago

I shall amend my pipeline, thanks.

  The tailing notes should be moved outside `seg` according to recommendations.

Still inside u or outside of it?

matyaskopp commented 1 year ago

Still inside u or outside of it?

outside if it is at the end of utterance., so

<u>
 <seg>text 1 <note>(n1)</note><note>(n2)</note></seg> 
 <seg>text 2 <note>(n3)</note><note>(n4)</note></seg> 
</u>

should be

<u>
 <seg>text 1</seg>
 <note>(n1)</note>
 <note>(n2)</note> 
 <seg>text 2</seg> 
</u>
<note>(n3)</note>
<note>(n4)</note>
nljubesi commented 1 year ago
  • (Poslanici dižu ruku.)
  • (Aplauz u sali.)
  • (Sednica je prekinuta u 10 časova i 50 minuta.)
  • (Posle pauze.) The tailing notes should be moved outside seg according to recommendations.

@matyaskopp What recommendations are you referring to here? General TEI or something ParlaMint-specific? Asking because we might be 1. missing some documentation or 2. not having read the documentation properly.

Thanks!

nljubesi commented 1 year ago

I also wonder (Peter and I have discussed before the issue where to place interruptions) whether by placing notes, interruptions and similar outside of segments we lose the original "paragraph" structure that was present in the original transcripts, and that does carry some organisational information.

Keeping the organisational structure of "paragraphs" was my argument for not splitting segments because of notes / interruptions.

5roop commented 1 year ago

I also wonder (Peter and I have discussed before the issue where to place interruptions) whether by placing notes (interruptions and similar) outside of segments we lose the original "paragraph" structure that was present in the original transcripts, and that does carry some organisational information.

Keeping the organisational structure of "paragraphs" was my argument for not splitting segments because of notes / interruptions.

E.g., see https://github.com/5roop/ParlaMint/blob/40a0723eec5b45e82fdf4e63214088f7bfdf9e52/Data/ParlaMint-BA/ParlaMint-BA_1998-11-26-0.xml#L509 .

Right now I treat the same way as , which means that I take it out if trailing. Specifically in this case this means we lose the information on which utterance was inaudible, which could be inferred from the text, but I'm not sure this is what we want.

Obvious solution would be don't treat gaps the same way as notes, but are we sure that a similar thing can't happen with notes too?

I guess if everyone knows that only trailing notes are treated this way, this is easier, as one only has to re-insert them to the segment from which I just extracted it, but then I don't see why the correction is even necessary.

TomazErjavec commented 1 year ago

What recommendations are you referring to here?

The ParlaMint encoding guidelines, second paragraph in Sec. on Transcriber comments.

whether by placing notes, interruptions and similar outside of segments we lose the original "paragraph" structure that was present in the original transcripts, and that does carry some organisational information.

I don't think so, or, at least, I don't think it matters if a note is immediatelly before a paragraph in immediatelly at the start of it.

Obvious solution would be don't treat gaps the same way as notes, but are we sure that a similar thing can't happen with notes too?

Yes, I agree, do not treat gaps in the same way - the guidelines do not recommend it either. But for notes, I can't think of a case where this would cause problems.

I don't see why the correction is even necessary.

Some corpora have notes only in between or at the start of end of paragraphs. For these, we avoid mixed content in segments with this fix, which makes the linguistic annotation much easier. Also, it is nice to have all the corpora encoded in the same way.

nljubesi commented 1 year ago

Ok, I might be starting understanding things:

Noting here that @TomazErjavec can intervene if we misunderstood what is to be done.

TomazErjavec commented 1 year ago

Noting here that @TomazErjavec can intervene if we misunderstood what is to be done.

I interevene only to say that this is exactly right! You put it very concisely, I guess we should add this to the guidelines, where it is at least partially impicit...

5roop commented 1 year ago

I think the latest commits should cover the issues identified above, except if @matyaskopp finds something else fishy. Thanks, everyone!

matyaskopp commented 1 year ago

Ok, I might be starting understanding things:

  • trailing notes are pushed outside segments and, if outside the last segment, pushed out of utterances
  • trailing gaps are pushed outside segments, but kept in utterances
  • notes or gaps in segments, enclosed by text, don't break up the segments, but they stay inside

I would like to extend @nljubesi list with options for an annotated version

matyaskopp commented 1 year ago

I think the latest commits should cover the issues identified above, except if @matyaskopp finds something else fishy. Thanks, everyone!

I do not see anything else. Thanks for your work. Let me know once you have annotated version of the sample. I will quickly check it too.

nljubesi commented 1 year ago

Thank you rather both @matyaskopp and @TomazErjavec for your really hard work.

@matyaskopp I am not sure I understand the "trailing notes are pushed outside sentences" - do you refer to the case where a note is inside the text in a segment, but ends up being at the end of a sentence? Trailing notes on the level of segments are outside segments anyway, so no way for them to be inside sentences.

This is all so (too?) complicated... :-)

5roop commented 1 year ago

I, too, am puzzled by the introduction of sentences at this stage, mostly because do not have sentences yet. Perhaps it's best if we kindly ask @matyaskopp for another example to illustrate this point?

matyaskopp commented 1 year ago

@matyaskopp I am not sure I understand the "trailing notes are pushed outside sentences" - do you refer to the case where a note is inside the text in a segment, but ends up being at the end of a sentence? Trailing notes on the level of segments are outside segments anyway, so no way for them to be inside sentences.

wrong:

<s xml:id="s1">
  <w xml:id="s1.w1">...</w>
  <note>note 1</note>
  <w xml:id="s1.w2">...</w> 
  <note>note 2</note>
</s>

correct:

<s xml:id="s1">
  <w xml:id="s1.w1">...</w>
  <note>note 1</note>
  <w xml:id="s1.w2">...</w> 
</s>
<note>note 2</note>
5roop commented 1 year ago

Thanks, @matyaskopp , that was surreally fast :D

We'll check annotated corpora with this in mind.

5roop commented 1 year ago

We investigated the annotated sample and we think it's ok regarding the trailing notes in sentences. We had to reduce the sample a bit to stay under 100MB.

@matyaskopp, I think the tickbox under Session / meeting confusion should be marked as done, because the titles were replaced, or did I overlook something?

matyaskopp commented 1 year ago

@matyaskopp, I think the tickbox under Session / meeting confusion should be marked as done, because the titles were replaced, or did I overlook something?

ok if you have decided to use session, and this is correct: https://github.com/5roop/ParlaMint/blob/87de0bdd927c98333c6cf3e36b10cf32816df3d7/Data/ParlaMint-BA/ParlaMint-BA_1998-11-26-0.xml#L11

<meeting n="01" corresp="#PS" ana="#parla.uni #parla.session">Sjednica 01</meeting>

then I can mark it as done, The definitions are: https://github.com/clarin-eric/ParlaMint/blob/89fa819303d66d916ef97b83c368150c6d0ef5b6/Data/Taxonomies/ParlaMint-taxonomy-parla.legislature.xml#L156 https://github.com/clarin-eric/ParlaMint/blob/89fa819303d66d916ef97b83c368150c6d0ef5b6/Data/Taxonomies/ParlaMint-taxonomy-parla.legislature.xml#L162

So if you agree with this, I will tick the last tickbox

5roop commented 1 year ago

Oh I see, I did overlook something, I managed to use an old template at some point it seems.

The correct reference should be meeting, since in all three corpora there usually more than a single order of business discussed, so I'll change sitting -> meeting.

matyaskopp commented 1 year ago

great, So now please add a TEI.ana sample and I will check it quickly

5roop commented 1 year ago

For now I only present the HR corpus, the other root ana files are still under construction. @matyaskopp, is HR enough for you to give us a go ahead for all corpora?

(Just got the email that the latest commit fails validation, but this is because the sample size is too big. Locally it validates. I'll prune it again and commit again presently.)

matyaskopp commented 1 year ago

For now I only present the HR corpus, the other root ana files are still under construction. @matyaskopp, is HR enough for you to give us a go ahead for all corpora?

Yes, HR TEI.ana sample is ok. So you can continue with RS and BA.

5roop commented 1 year ago

@matyaskopp, @TomazErjavec, I found weird behaviour when validating annotated RS files.

make validate-parlamint-RS outputs numerous errors like this: /home/rupnik/ParlaMint/Data/ParlaMint-RS/ParlaMint-RS_1997-12-03-0.xml:117:2600: error: text not allowed here; expected element "gap", "incident", "kinesic", "note", "pb", "s" or "vocal" But line 117 only has 112 characters.

Another error is for instance /home/rupnik/ParlaMint/Data/ParlaMint-RS/ParlaMint-RS_1997-12-03-0.xml:326:7: error: text not allowed here; expected the element end-tag or element "gap", "incident", "kinesic", "note", "pb", "s" or "vocal", but here the 7th character is indentation before the actual xml content.

Is this a validation bug or are our ana data corrupted? I checked them visually, and I don't see what in RS corpus is so different that validation would raise errors, while no component in BA and HR corpora raised them.

I will be linking this issue in the latest commit if you'd care to inspect the data.

TomazErjavec commented 1 year ago

This is because in your root .ana file (i.e. ParlaMint-RS.ana.xml ) you XInclude the non-annotated component files, so the linguistically un-annotated files get validated with the schema for annotated files (which expects sentences, not text inside segments).

So, instead of <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ParlaMint-RS_1997-12-03-0.xml"/> you should have there <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ParlaMint-RS_1997-12-03-0.ana.xml"/> etc.

5roop commented 1 year ago

Wow, thanks for the quick response, this is a stupid mistake for a stupidly simple task that I really should have automated sooner....

TomazErjavec commented 1 year ago

Can we close this?

5roop commented 1 year ago

@TomazErjavec, I sent an email with the path to corrected RS corpus last week, specifically on 26. 01. 2023, 09:50. If you saw it and propagated the latest version downstream, I see no reason to keep it open.

TomazErjavec commented 1 year ago

As the corpora are meant to now be char cleaned, let's keep this open till the sample from the really final 3.0 corpus is ready.

TomazErjavec commented 1 year ago

Sample ok now, so closing.