Closed matyaskopp closed 1 year ago
Thanks, @matyaskopp, will revise and resubmit PR.
@matyaskopp, should notes on names of speakers be present in Serbian and Croatian as well? The data sources used for Serbian and Croatian are these: Serbian:
Croatian (the speaker is actually "Reiner, Željko") :
I would be against adding the speaker notes to Serbian and Croatian. Honestly, I might prefer not adding notes to Bosnian neither, the speaker names are blended with the text due to the inability of the parliament to encode their metadata properly, but they do seem to be part of the main text.
I would be against adding the speaker notes to Serbian and Croatian.
I would tend to agree, so, unless @matyaskopp has serious reservations, I would say it is ok not to have these notes.
I would be against adding the speaker notes to Serbian and Croatian.
I would tend to agree, so, unless @matyaskopp has serious reservations, I would say it is ok not to have these notes.
I am not happy about cropping these notes, but I leave it up to you. This should probably be documented in editorialDecl
, but I am not sure in which section (correction
or normalization
?)
All three languages have the rest of the feedback implemented. BA with included speaker-type notes is in this commit.
HR has been prepared in two versions: with speaker-type notes and without them.
Same goes for RS: with speaker-type notes and without them.
Other than a decision on the correct section to document the I'd also appreciate an approximate wording.
@matyaskopp How would the notes in Croatian and Serbian need to look like? Only names of the speakers or the full description available from the respective web pages?
Names @5roop already inserted, the whole description would require us to crawl everything from scratch, because we have obtained the data already crawled by an upstream project.
I think it is quite ok to have only the names of the speakers, let's not overcomplicate.
As for mentioning the fact that the full description is not included in the corpus in editorialDecl: I would not. Mostly because none of the elements there are meant from describing this, neither correction nor normalization. Also, it really is a minor detail, others, I'm sure, do more radical things without mentioning them.
It seems that I can't submit new PR while an old one is still open, do you need me to do anything else in order to facilitate the merge?
There is no need to open a new pull request. Once you open a pull request it is automatically synced and validated.
Ok, thanks. Do you want to inspect the sample again or can we go ahead with full-scale data preparation and submission?
you have removed some notes in this commit: https://github.com/5roop/ParlaMint/commit/11171ef75e4cd71d55b81c864112238a56b529ce and the rest of the notes are missing... https://github.com/5roop/ParlaMint/blob/448fb611e4132fda3af4c955b1fd6a492fc84d2c/Data/ParlaMint-RS/ParlaMint-RS_1997-12-03-0.xml#L133
I think everything in "()" is a note. Or at least these are clear ones:
seg
according to recommendations.I shall amend my pipeline, thanks.
The tailing notes should be moved outside `seg` according to recommendations.
Still inside
u
or outside of it?
Still inside
u
or outside of it?
outside if it is at the end of utterance., so
<u>
<seg>text 1 <note>(n1)</note><note>(n2)</note></seg>
<seg>text 2 <note>(n3)</note><note>(n4)</note></seg>
</u>
should be
<u>
<seg>text 1</seg>
<note>(n1)</note>
<note>(n2)</note>
<seg>text 2</seg>
</u>
<note>(n3)</note>
<note>(n4)</note>
- (Poslanici dižu ruku.)
- (Aplauz u sali.)
- (Sednica je prekinuta u 10 časova i 50 minuta.)
- (Posle pauze.) The tailing notes should be moved outside
seg
according to recommendations.
@matyaskopp What recommendations are you referring to here? General TEI or something ParlaMint-specific? Asking because we might be 1. missing some documentation or 2. not having read the documentation properly.
Thanks!
I also wonder (Peter and I have discussed before the issue where to place interruptions) whether by placing notes, interruptions and similar outside of segments we lose the original "paragraph" structure that was present in the original transcripts, and that does carry some organisational information.
Keeping the organisational structure of "paragraphs" was my argument for not splitting segments because of notes / interruptions.
I also wonder (Peter and I have discussed before the issue where to place interruptions) whether by placing notes (interruptions and similar) outside of segments we lose the original "paragraph" structure that was present in the original transcripts, and that does carry some organisational information.
Keeping the organisational structure of "paragraphs" was my argument for not splitting segments because of notes / interruptions.
Right now I treat
Obvious solution would be don't treat gaps the same way as notes, but are we sure that a similar thing can't happen with notes too?
I guess if everyone knows that only trailing notes are treated this way, this is easier, as one only has to re-insert them to the segment from which I just extracted it, but then I don't see why the correction is even necessary.
What recommendations are you referring to here?
The ParlaMint encoding guidelines, second paragraph in Sec. on Transcriber comments.
whether by placing notes, interruptions and similar outside of segments we lose the original "paragraph" structure that was present in the original transcripts, and that does carry some organisational information.
I don't think so, or, at least, I don't think it matters if a note is immediatelly before a paragraph in immediatelly at the start of it.
Obvious solution would be don't treat gaps the same way as notes, but are we sure that a similar thing can't happen with notes too?
Yes, I agree, do not treat gaps in the same way - the guidelines do not recommend it either. But for notes, I can't think of a case where this would cause problems.
I don't see why the correction is even necessary.
Some corpora have notes only in between or at the start of end of paragraphs. For these, we avoid mixed content in segments with this fix, which makes the linguistic annotation much easier. Also, it is nice to have all the corpora encoded in the same way.
Ok, I might be starting understanding things:
Noting here that @TomazErjavec can intervene if we misunderstood what is to be done.
Noting here that @TomazErjavec can intervene if we misunderstood what is to be done.
I interevene only to say that this is exactly right! You put it very concisely, I guess we should add this to the guidelines, where it is at least partially impicit...
I think the latest commits should cover the issues identified above, except if @matyaskopp finds something else fishy. Thanks, everyone!
Ok, I might be starting understanding things:
- trailing notes are pushed outside segments and, if outside the last segment, pushed out of utterances
- trailing gaps are pushed outside segments, but kept in utterances
- notes or gaps in segments, enclosed by text, don't break up the segments, but they stay inside
I would like to extend @nljubesi list with options for an annotated version
I think the latest commits should cover the issues identified above, except if @matyaskopp finds something else fishy. Thanks, everyone!
I do not see anything else. Thanks for your work. Let me know once you have annotated version of the sample. I will quickly check it too.
Thank you rather both @matyaskopp and @TomazErjavec for your really hard work.
@matyaskopp I am not sure I understand the "trailing notes are pushed outside sentences" - do you refer to the case where a note is inside the text in a segment, but ends up being at the end of a sentence? Trailing notes on the level of segments are outside segments anyway, so no way for them to be inside sentences.
This is all so (too?) complicated... :-)
I, too, am puzzled by the introduction of sentences at this stage, mostly because do not have sentences yet. Perhaps it's best if we kindly ask @matyaskopp for another example to illustrate this point?
@matyaskopp I am not sure I understand the "trailing notes are pushed outside sentences" - do you refer to the case where a note is inside the text in a segment, but ends up being at the end of a sentence? Trailing notes on the level of segments are outside segments anyway, so no way for them to be inside sentences.
wrong:
<s xml:id="s1">
<w xml:id="s1.w1">...</w>
<note>note 1</note>
<w xml:id="s1.w2">...</w>
<note>note 2</note>
</s>
correct:
<s xml:id="s1">
<w xml:id="s1.w1">...</w>
<note>note 1</note>
<w xml:id="s1.w2">...</w>
</s>
<note>note 2</note>
Thanks, @matyaskopp , that was surreally fast :D
We'll check annotated corpora with this in mind.
We investigated the annotated sample and we think it's ok regarding the trailing notes in sentences. We had to reduce the sample a bit to stay under 100MB.
@matyaskopp, I think the tickbox under Session / meeting confusion
should be marked as done, because the titles were replaced, or did I overlook something?
@matyaskopp, I think the tickbox under
Session / meeting confusion
should be marked as done, because the titles were replaced, or did I overlook something?
ok if you have decided to use session
, and this is correct:
https://github.com/5roop/ParlaMint/blob/87de0bdd927c98333c6cf3e36b10cf32816df3d7/Data/ParlaMint-BA/ParlaMint-BA_1998-11-26-0.xml#L11
<meeting n="01" corresp="#PS" ana="#parla.uni #parla.session">Sjednica 01</meeting>
then I can mark it as done, The definitions are: https://github.com/clarin-eric/ParlaMint/blob/89fa819303d66d916ef97b83c368150c6d0ef5b6/Data/Taxonomies/ParlaMint-taxonomy-parla.legislature.xml#L156 https://github.com/clarin-eric/ParlaMint/blob/89fa819303d66d916ef97b83c368150c6d0ef5b6/Data/Taxonomies/ParlaMint-taxonomy-parla.legislature.xml#L162
So if you agree with this, I will tick the last tickbox
Oh I see, I did overlook something, I managed to use an old template at some point it seems.
The correct reference should be meeting, since in all three corpora there usually more than a single order of business discussed, so I'll change sitting -> meeting.
great, So now please add a TEI.ana sample and I will check it quickly
For now I only present the HR corpus, the other root ana files are still under construction. @matyaskopp, is HR enough for you to give us a go ahead for all corpora?
(Just got the email that the latest commit fails validation, but this is because the sample size is too big. Locally it validates. I'll prune it again and commit again presently.)
For now I only present the HR corpus, the other root ana files are still under construction. @matyaskopp, is HR enough for you to give us a go ahead for all corpora?
Yes, HR TEI.ana sample is ok. So you can continue with RS and BA.
@matyaskopp, @TomazErjavec, I found weird behaviour when validating annotated RS files.
make validate-parlamint-RS
outputs numerous errors like this:
/home/rupnik/ParlaMint/Data/ParlaMint-RS/ParlaMint-RS_1997-12-03-0.xml:117:2600: error: text not allowed here; expected element "gap", "incident", "kinesic", "note", "pb", "s" or "vocal"
But line 117 only has 112 characters.
Another error is for instance /home/rupnik/ParlaMint/Data/ParlaMint-RS/ParlaMint-RS_1997-12-03-0.xml:326:7: error: text not allowed here; expected the element end-tag or element "gap", "incident", "kinesic", "note", "pb", "s" or "vocal"
, but here the 7th character is indentation before the actual xml content.
Is this a validation bug or are our ana data corrupted? I checked them visually, and I don't see what in RS corpus is so different that validation would raise errors, while no component in BA and HR corpora raised them.
I will be linking this issue in the latest commit if you'd care to inspect the data.
This is because in your root .ana file (i.e. ParlaMint-RS.ana.xml ) you XInclude the non-annotated component files, so the linguistically un-annotated files get validated with the schema for annotated files (which expects sentences, not text inside segments).
So, instead of
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ParlaMint-RS_1997-12-03-0.xml"/>
you should have there
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ParlaMint-RS_1997-12-03-0.ana.xml"/>
etc.
Wow, thanks for the quick response, this is a stupid mistake for a stupidly simple task that I really should have automated sooner....
Can we close this?
@TomazErjavec, I sent an email with the path to corrected RS corpus last week, specifically on 26. 01. 2023, 09:50. If you saw it and propagated the latest version downstream, I see no reason to keep it open.
As the corpora are meant to now be char cleaned, let's keep this open till the sample from the really final 3.0 corpus is ready.
Sample ok now, so closing.
term in
meeting
meeting
should correspond to the proceeding timespanhttps://github.com/5roop/ParlaMint/blob/0a4f83ce4bb79ee297aa0c74f687cd0a95a68025/Data/ParlaMint-HR/ParlaMint-HR.xml#L15 this should be removed
There are no proceedings from the 10th term. It starts after the last date in corpus
2022-07-15
component file contains sitting
sitting in component file
https://github.com/5roop/ParlaMint/blob/d1185f4274bb6bd5c94efac73ddff3d069316aa7/Data/ParlaMint-BA/ParlaMint-BA_1999-02-10-0.xml#L2
should be
meeting element in the component file
#parla.uni
The
<meeting>
element in the component files should be structured. https://github.com/5roop/ParlaMint/blob/d1185f4274bb6bd5c94efac73ddff3d069316aa7/Data/ParlaMint-BA/ParlaMint-BA_1999-02-10-0.xml#L10should be
Session / meeting confusion
The title contains a word session, but I think the meeting is more proper (according to parla.legislature taxonomy). If you change session to meeting, you should also change this in
<meeting>
element (see above)very few notes in the text
Checked only Bosnian as it was easiest to understand the language to me. But other corpora have a few notes too....
I checked one steno that corresponds to this file: https://github.com/5roop/ParlaMint/blob/d1185f4274bb6bd5c94efac73ddff3d069316aa7/Data/ParlaMint-BA/ParlaMint-BA_1999-07-08-0.xml source: https://www.parlament.ba/session/SessionDetails?id=2427 pdf: https://www.parlament.ba/session/DownloadDocument?DocumentId=e4141245-622a-45e2-a3bc-cec04f48a0f8&langTag=bs
I miss notes in the text, e.g.:
Furthermore you can add whole initial section and use
<head>
and<note type="comment">