Closed matyaskopp closed 1 year ago
Wierd event label
There's a missing f string in Python. Is
<event from="2014-09-29" to="2018-09-24">
<label>Riksdagen 2014 - 2018</label>
</event>
correct?
debates beginning
This is how the original data is laid out, there is no clear distinction between debates and the metatext before that.
missing opposition relation
I tried to find a clear way to denote opposition and confidence and supply, but I didn't find one so I left it out. Is there one?
The rest of the points look like bugs that are rather straightforward and quick to fix.
One or two politicians?
<person xml:id="Q59387749">
<persName>
<surname>Andersson</surname>
<forename>Jonas</forename>
</persName>
<sex value="M"/>
<affiliation role="member" ref="#Riksdagen" from="2018-09-24"/>
<affiliation role="member" ref="#Q504069"/>
</person>
<person xml:id="Q58837098">
<persName>
<surname>Andersson</surname>
<forename>Jonas</forename>
</persName>
<sex value="M"/>
<affiliation role="member" ref="#Riksdagen" from="2018-09-24"/>
<affiliation role="member" ref="#Q504069"/>
</person>
Nah, Jonas Andersson might just be one of the most common names in Sweden.
Status 2022-11-15
missing opposition relation
I tried to find a clear way to denote opposition and confidence and supply, but I didn't find one so I left it out. Is there one?
The determination of whether a party is in opposition is usually based on how the party sees itself or how it sees the public. In CZ, it is common for parties to declare that they are in opposition - they don't agree with the government and don't want to take responsibility for the government's doing. There is no contract saying someone is in opposition, so it is a bit fuzzy. To conclude: it is up to you how you see it...
I agree with @matyaskopp, except for countries with a majority government like Slovenia, where you are either in the coalition, and thus form the government, or you are in opposition. The only exception here are the independent MPs.
Okay. So we should
Not sure I quite understand, or, rather, who is then marked with "coalition"? Also, you don't realy " mark opposition parties with the 'opposition' tag", rather, you introduce a relation grouping them as opposition, cf. 5.2.4. Relations between organisations.
debates beginning
This is how the original data is laid out, there is no clear distinction between debates and the metatext before that.
I do not understand Swedish, but with google translator, it seems to me that announcements start with:
Talmannen (meddelade|anmälde)
But it looks more-like a pronouncement that someone said something, so I guess it can be encoded as a note.
And the regular/chair speeches are highlighted and numbered:
Anf. {number} {name or TALMANNEN} ({party for regular speaker}):
and it is in h2
element in this format: https://www.riksdagen.se/sv/dokument-lagar/dokument/protokoll/protokoll-202122141-mandagen-den-5-september_H909141/html#_Toc115951428
Sometimes there are interpelations (https://www.riksdagen.se/sv/dokument-lagar/dokument/protokoll/protokoll-202122140-mandagen-den-5-september_H909140/html#_Toc115950042) that are in written (I guess) form and are stored in another place.
The question is how to determine the end of speech, there is sometimes applause and then continue a note that is not highlighted:
Vi från Miljöpartiet är tydliga: Vi måste stötta svenska företag och hushåll och civilsamhället genom detta. Vi måste stötta ekonomiskt, speciellt de allra svagaste. Vi måste få en ändring av prissättningen på elen. Vi måste bygga ut det förnybara, och i allt detta måste vi också energieffektivisera. Det är så vi bygger Sverige starkare tillsammans både på kort och på lång sikt.
(Applåder)
Överläggningen var härmed avslutad.
(Beslut fattades under § 8.)
I am not sure if I miss something, but it seems that your protocol contains more notes than speeches. So notes should be encoded as notes - not making speeches from them.
incidents encoding documentation: https://clarin-eric.github.io/ParlaMint/#sec-incidents
<kinesic type="applause">
<desc>(Applåder)</desc>
</kinesic>
@matyaskopp The paragraphs are annotated into utterances, segments etc automatically using BERT, which is why there are some occasiaonal misclassified "utterances" mixed in the metatext in the beginning of protocols. We can use some heuristics to improve the quality of that classification if that is necessary (the protocols 2015-2022 seem to be more consistent than the whole 1920-2022 period we're working with).
However, you also imply the protocols should not start with a bunch of notes. Our plan has been not to exclude any data from the original protocols, but rather to annotate so that eg. only utterances can be extracted afterwards in any downstream task. Can we go on doing this?
@TomazErjavec I mean, is it necessary to label the supporting parties (in the way the schema proposes, technicalities are not relevant to the question) at all, if we do label the governments and the opposition blocks? AFAIK, the definition of supporting parties can be a bit blurry in the Swedish parliament.
@matyaskopp The paragraphs are annotated into utterances, segments etc automatically using BERT, which is why there are some occasiaonal misclassified "utterances" mixed in the metatext in the beginning of protocols. We can use some heuristics to improve the quality of that classification if that is necessary (the protocols 2015-2022 seem to be more consistent than the whole 1920-2022 period we're working with).
However, you also imply the protocols should not start with a bunch of notes. Our plan has been not to exclude any data from the original protocols, but rather to annotate so that eg. only utterances can be extracted afterwards in any downstream task. Can we go on doing this?
I did not know that the utterance annotation is done automatically with BERT, and I do not want to remove any data - beginning notes are ok to preserve, but there are flagrant mixtures of notes and misclassified utterances.
If the protocols from the ParlaMint period are more consistent, they can probably be segmented into utterances by rules: An utterance starts with something like this:
<h2>Anf. 1 TALMANNEN:</h2>
@matyaskopp I think we are missing the forest from the trees here. Are you available for a quick zoom call tomorrow or on friday?
@matyaskopp I think we are missing the forest from the trees here. Are you available for a quick zoom call tomorrow or on friday?
@ninpnin, ok, friday is better (anytime before noon). Tommorow we have a state holiday and childrens are at home... Please send me an email with link and time that fits you.
Could you then also pls. discuss https://github.com/clarin-eric/ParlaMint/issues/436#issuecomment-1316504335, I can't answer that simply.
Status 2022-11-18
@matyaskopp I've made my changes, I think you can check the files again now.
wrong date in corpus root setting
- [x] fix root
setting/date
element value- [x] fix/remove
setting/date/@ana
setting/date
should contain the timespan of the corpus (from
-to
), and if you want to add@ana
attribute, it should contain a list of terms https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.xml#L316<setting> <name type="org">Sveriges riksdag</name> <name type="address">Riksgatan 1</name> <name type="city">Stockholm</name> <name type="country">Sweden</name> <date when="2016-09-15" ana="#parla.sitting">2016-09-15</date> </setting>
from
and to
https://github.com/ninpnin/ParlaMint/blob/18c7a5124a0a5c925a387031213888db583617a6/Data/ParlaMint-SE/ParlaMint-SE.xml#L301
<date from="2015-01-01" to="2022-07-01">2015-01-01 - 2022-10-01</date>
missing Swedish translations in taxonomies
- [ ] taxonomies translations
cs
content marked as sv
https://github.com/ninpnin/ParlaMint/blob/18c7a5124a0a5c925a387031213888db583617a6/Data/ParlaMint-SE/ParlaMint-SE.xml#L251
<taxonomy xml:id="meeting.parts">
<desc xml:lang="sv">
<term>Bod</term>
</desc>
<desc xml:lang="en">
<term>Agenda</term>
</desc>
<category xml:id="parla.agenda">
<catDesc xml:lang="sv">
<term>Bod jednání</term>
</catDesc>
<catDesc xml:lang="en"><term>Agenda</term>: topic discussed during sitting</catDesc>
</category>
</taxonomy>
<org xml:id="Riksdagen" role="parliament" ana="#parla.uni #parla.national">
<orgName full="yes" xml:lang="sv">Sveriges riksdag</orgName>
<orgName full="abb" xml:lang="sv">Riksdagen</orgName>
<listEvent>
<event from="2014-09-29" to="2018-09-24">
<label>Riksdagen 2014 - 2018</label>
</event>
<event from="2018-09-24" to="2022-09-27">
<label>Riksdagen 2018 - 2022</label>
</event>
<event from="2022-09-27">
<label>Riksdagen 2022 - 2026</label>
</event>
</listEvent>
</org>
your current term started at 2022-09-27, if you don't have any text content from this term, the meeting should be removed https://github.com/ninpnin/ParlaMint/blob/18c7a5124a0a5c925a387031213888db583617a6/Data/ParlaMint-SE/ParlaMint-SE.xml#L10
<meeting n="2022-2026" ana="#parla.uni #parla.term">Mandatperioden 2022–2026</meeting>
BTW, are you sure that there will not be an early election in Sweden? You are setting date in future in the text.
@join="right"
@join="right"
https://clarin-eric.github.io/ParlaMint/#sec-ana-words
<s xml:id="i-PDtgGeMQC837eq5Uk8pet4">
<w msd="UPosTag=NOUN|Case=Nom|Definite=Ind|Gender=Com|Number=Sing" lemma="fru" xml:id="i-PDN9z16TfCMx8fbyzdAR3J">Fru</w>
<!-- next token should contain attribute join: -->
<w msd="UPosTag=NOUN|Case=Nom|Definite=Ind|Gender=Com|Number=Sing" lemma="talman" xml:id="i-PDNASeziU3EPzn6PQjv8bv">talman</w>
<pc msd="UPosTag=PUNCT" xml:id="i-PDNAaz6AqvkfL4d1j9n3Tz">!</pc>
<linkGrp targFunc="head argument" type="UD-SYN">
<link ana="ud-syn:det" target="#i-PDNASeziU3EPzn6PQjv8bv #i-PDN9z16TfCMx8fbyzdAR3J"/>
<link ana="ud-syn:punct" target="#i-PDNASeziU3EPzn6PQjv8bv #i-PDNAaz6AqvkfL4d1j9n3Tz"/>
<link ana="ud-syn:root" target="#i-PDtgGeMQC837eq5Uk8pet4 #i-PDNASeziU3EPzn6PQjv8bv"/>
</linkGrp>
</s>
name
is missing type
name/@type
https://clarin-eric.github.io/ParlaMint/#sec-ner
<name>
<w msd="UPosTag=PROPN|Case=Nom" lemma="Mats" xml:id="i-PDNAiUsgPE86jDhNp83xqr">Mats</w>
</name>
<name>
<w msd="UPosTag=PROPN|Case=Nom" lemma="Green" xml:id="i-PDNApK3JFMBtG7sDSD9uFJ">Green</w>
</name>
<prefixDef ident="ne" matchPattern="(.+)" replacementPattern="#NER.cnec2.0.$1">
<p>Taxonomy for named entities (cnec2.0)</p>
</prefixDef>
I think toc, should be removed (it is not a debateSection) https://github.com/ninpnin/ParlaMint/blob/18c7a5124a0a5c925a387031213888db583617a6/Data/ParlaMint-SE/ParlaMint-SE_2015-10-23-prot-201516--19.xml#L659
<div type="debateSection">
<head xml:id="i-49WGkDYanfrBThGEhk84fS">§ 1 Anmälan om fördröjda svar på interpellationer</head>
<note xml:id="i-JAysHk636TLJv63cvK4rPw">§ 2 Ärenden för hänvisning till utskott</note>
<note xml:id="i-4TnEn6p6xvQ8jVnVStMZ4v">§ 3 Svar på interpellation 2015/16:49 om stöd till kommuner vid mottagande av ensamkommande flyktingbarn</note>
<note xml:id="i-XxebtctPXck2uwgTMJJVEB" type="speaker">Anf. 1 Justitie- och migrationsminister MORGAN JOHANSSON (S)</note>
<note xml:id="i-HhiKLKZn5mipTeviSARiJw" type="speaker">Anf. 2 MATS GREEN (M)</note>
<note xml:id="i-DeeVZkTEi4U1kWpVjA2XCo" type="speaker">Anf. 3 Justitie- och migrationsminister MORGAN JOHANSSON (S)</note>
<note xml:id="i-8Mh4hbZzi3Ak2AA2ohxtSc" type="speaker">Anf. 4 MATS GREEN (M)</note>
<note xml:id="i-5dnmYPNMGXt65udU3NKhKb" type="speaker">Anf. 5 Justitie- och migrationsminister MORGAN JOHANSSON (S)</note>
@TomazErjavec, I like the structuring of the document ( https://github.com/ninpnin/ParlaMint/blob/18c7a5124a0a5c925a387031213888db583617a6/Data/ParlaMint-SE/ParlaMint-SE_2015-10-23-prot-201516--19.xml)
Adding div
and head
made it well arranged, but there are div[@type="debateSection"]
which are not really debates. I tend to remove type="debateSection"
and preserve the structure. Do you agree?
Hm, I don't like having a new typeless type of div.
I would say these are either stand alone notes at the start of the body (before the first div), which would be the principle of minimal effort. The "proper" way of doing it would be to introduce <front>
, as this is obviously front-matter, and front-matter should not be linguistcially annotated. But this means changing the schema, thinking about exactly what front can contain, and maybe chaning the corpora of other partners - do we want to do all this now?
Third option: remove the ToC.
@matyaskopp @TomazErjavec the schema does not enumerate the values type can take, does it? Let's make it div type="tableOfContents" ?
I don't want to remove data. You'll never gonna notice if you've accidentally removed debate sections.
I don't want to remove data.
OK. Let's contunue this is #472.
Status 2022-11-23
@matyaskopp All the problems you reported (except for the missing translations) should be fixed now. I also decided to just bite the bullet and remove the TOCs.
I assume the debate section thing can be changed once you decide what to do with it, it should be easy enough from our side.
From my side, it would be good to know if the corpus is now at an acceptable standard. If not, I'd like to have all remaining critical problems listed here at once. I have limited time resources to go back and forth with this.
Thanks for the changes in your corpus. Your corpus is now significantly better. I hope this is the final list of problems that are spottable in the sample with my tired eyes. There can appear another one when @TomazErjavec loads it into noSketch, because I am checking just the sample without seeing the whole corpus.
<taxonomy xml:id="meeting.parts">
if not usedThere are still taxonomies that are not used in your corpus, I guess. Can you please run factorization, which extracts all taxonomies into separate files:
# factorize taxonomies:
make factorize-teiHeader-INPLACE-SE
# add new files into repository (taxonomies and list of persons and organizations)
git add Data/ParlaMint-SE/ParlaMint-SE-taxonomy-*.xml
git add Data/ParlaMint-SE/ParlaMint-taxonomy-*.xml
git add Data/ParlaMint-SE/ParlaMint-SE-list*.xml
div/[@type="commentSection"]
<note xml:id="i-Sd8foAAkXywxAbqQKr4Ykt">( Applåder )</note>
I was not able to find a different type of incident, so I hope there is none.
I don't know what model you are using because your application description doesn't mention it explicitly (proper name, version). But I believe that your model supports multi-token named entities, so this:
<name type="PER">
<w msd="UPosTag=PROPN|Case=Nom" lemma="Morgan" xml:id="i-4Vkv8ELR4zHptJa1VWJFWk">Morgan</w>
</name>
<name type="PER">
<w msd="UPosTag=PROPN|Case=Nom" lemma="Johansson" xml:id="i-4VkvE4W2w7McRCjr7bQBvC">Johansson</w>
</name>
should be
<name type="PER">
<w msd="UPosTag=PROPN|Case=Nom" lemma="Morgan" xml:id="i-4Vkv8ELR4zHptJa1VWJFWk">Morgan</w>
<w msd="UPosTag=PROPN|Case=Nom" lemma="Johansson" xml:id="i-4VkvE4W2w7McRCjr7bQBvC">Johansson</w>
</name>
We are not insisting on subtitle #480
https://clarin-eric.github.io/ParlaMint/#sec-titleStmt
The title statement starts with two titles (one main, the other subordinate), both in English and the local language, with the appropriate language code possibly inherited from a superordinate element. They are distinguished by the value main or sub of their type attribute and the value of their xml:lang attribute. In the example it can be seen that the main title of a corpus component is simply an extension of the corpus root title, as it also gives the name of the particular meeting that the component contains, while the subordinate title is, again, free text. Both titles must be unique in the complete corpus.
Status 2022-11-28
WONTFIX: our NER tool does not detect multi-token entities. We're already using a backup as the primary one does not work.
Well, this is sad. Swedish is hardly a less resourced language, I just checked with Mr. Google, and there are a lot of NER tools for Swedish, so I can't help wondering why you would use a crippled one... But if you are happy with Swedish having different and less usefull NEs from all the rest of the corpora, then on your head be it!
why you would use a crippled one
- hfst-SweNER does not seem to be maintained anymore, and we don't have the time to debug the python2 code that's breaking
- BERT/huggingface NER is easy to integrate to our python scripts, but while accurate is limited in features
- Anything else would need more time to integrate into our codebase, and as mentioned, we don't have that
BERT/huggingface NER is easy to integrate to our python scripts, but while accurate is limited in features
Yes, this one seemed the most promising to me. I don't know what you mean by "limited in features", but I have problems imagining it is worse than having individual words as names.
Of course, there is another way, i.e. to join n successive names into one, at least in case they have the same class.
You mean to hack together something post-hoc? I mean that's possible but there I can come up with situations where that fails.
You mean to hack together something post-hoc?
Yes.
I mean that's possible but there I can come up with situations where that fails.
Not sure why, if they both have the same class , then just merge the two names, if not, leave them apart (unless there is some nice regularity that you would observe, but this could be overdoing it). I image most would be two PER, and PER is also the most useful for further analysis (who mentions who).
Well, then I'll implement that heuristic. Let's hope the edge cases that break it are few and far in between.
Status 2022-11-29
@matyaskopp @TomazErjavec Status 2022-12-02
Here is a link to the files:
Here is a link to the files
I take it this is suposed to be the fill TEI (but not .ana) encoded version? If so:
I corrected the XIncludes localy, so I could try the finalization step, which does have some errors nd warnings, cf. https://nl.ijs.si/et/tmp/ParlaMint/Repo/ParlaMint-SE.log and do "grep -i error" and "grep -i warning"
@TomazErjavec there are no errors when I download the log file?
Yup, there are none, very nice. This was stock advice. So, just do $ grep -i warning
Unique warnings
WARN ParlaMint-SE_2019-11-07-prot-201920--28: fixing subcorpus to covid for date 2019-11-07
WARN ParlaMint-SE_2019-11-08-prot-201920--29: fixing subcorpus to covid for date 2019-11-08
WARN ParlaMint-SE_2019-11-12-prot-201920--30: fixing subcorpus to covid for date 2019-11-12
WARN ParlaMint-SE_2019-11-13-prot-201920--31: fixing subcorpus to covid for date 2019-11-13
WARN ParlaMint-SE_2019-11-14-prot-201920--32: fixing subcorpus to covid for date 2019-11-14
WARN ParlaMint-SE_2019-11-15-prot-201920--33: fixing subcorpus to covid for date 2019-11-15
WARN ParlaMint-SE_2019-11-19-prot-201920--34: fixing subcorpus to covid for date 2019-11-19
WARN ParlaMint-SE_2019-11-20-prot-201920--35: fixing subcorpus to covid for date 2019-11-20
WARN ParlaMint-SE_2019-11-21-prot-201920--36: fixing subcorpus to covid for date 2019-11-21
WARN ParlaMint-SE_2019-11-22-prot-201920--37: fixing subcorpus to covid for date 2019-11-22
WARN ParlaMint-SE_2019-11-26-prot-201920--38: fixing subcorpus to covid for date 2019-11-26
WARN ParlaMint-SE_2019-11-27-prot-201920--39: fixing subcorpus to covid for date 2019-11-27
WARN ParlaMint-SE_2019-11-28-prot-201920--40: fixing subcorpus to covid for date 2019-11-28
WARN ParlaMint-SE_2019-11-29-prot-201920--41: fixing subcorpus to covid for date 2019-11-29
WARN ParlaMint-SE_2019-12-02-prot-201920--42: fixing subcorpus to covid for date 2019-12-02
WARN ParlaMint-SE_2019-12-03-prot-201920--43: fixing subcorpus to covid for date 2019-12-03
WARN ParlaMint-SE_2019-12-04-prot-201920--44: fixing subcorpus to covid for date 2019-12-04
WARN ParlaMint-SE_2019-12-05-prot-201920--45: fixing subcorpus to covid for date 2019-12-05
WARN ParlaMint-SE_2019-12-06-prot-201920--46: fixing subcorpus to covid for date 2019-12-06
WARN ParlaMint-SE_2019-12-09-prot-201920--47: fixing subcorpus to covid for date 2019-12-09
WARN ParlaMint-SE_2019-12-10-prot-201920--48: fixing subcorpus to covid for date 2019-12-10
WARN ParlaMint-SE_2019-12-11-prot-201920--49: fixing subcorpus to covid for date 2019-12-11
WARN ParlaMint-SE_2019-12-12-prot-201920--50: fixing subcorpus to covid for date 2019-12-12
WARN ParlaMint-SE_2019-12-13-prot-201920--51: fixing subcorpus to covid for date 2019-12-13
WARN ParlaMint-SE_2019-12-16-prot-201920--52: fixing subcorpus to covid for date 2019-12-16
WARN ParlaMint-SE_2019-12-17-prot-201920--53: fixing subcorpus to covid for date 2019-12-17
WARN ParlaMint-SE_2019-12-18-prot-201920--54: fixing subcorpus to covid for date 2019-12-18
WARN ParlaMint-SE_2019-12-19-prot-201920--55: fixing subcorpus to covid for date 2019-12-19
WARN ParlaMint-SE_2019-12-20-prot-201920--56: fixing subcorpus to covid for date 2019-12-20
WARN: /project/corpora/Parla/ParlaMint/V3/Data/ParlaMint-SE.TEI/ParlaMint-SE-listOrg.xml not found
WARN: /project/corpora/Parla/ParlaMint/V3/Data/ParlaMint-SE.TEI/ParlaMint-SE-listPerson.xml not found
WARN: No .ana files for SE samples
WARN: No ana root file, skipping
WARN: party without proper name Q10585380
WARN: party without proper name Q3360009
WARN: party without proper name Q50383811
WARN: party without proper name Q61791721
WARN: short date 2006-05
WARN: short date 2016-10
AFAIK this is automatically fixed, @TomazErjavec confirm?
Yes.
@matyaskopp @TomazErjavec Status 2022-12-06
Here is a link to the files:
@ninpnin great, can you update the sample on github, please?
And if possible factorize tei header:
# factorize taxonomies: make factorize-teiHeader-INPLACE-SE # add new files into repository (taxonomies and list of persons and organizations) git add Data/ParlaMint-SE/ParlaMint-SE-taxonomy-*.xml git add Data/ParlaMint-SE/ParlaMint-taxonomy-*.xml git add Data/ParlaMint-SE/ParlaMint-SE-list*.xml
@matyaskopp the sample is now updated. Where do you want the factorized files?
component filenames
ParlaMint-SE_YYYY-MM-DD<suffix without '_'>.xml
/TEI/@xml:id
can you please rename component files according to the recommendations: 2.3. File names and directory structure
wrong
meeting
text contentteiCorpus//meeting/text()
https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.xml#L8-L10
missing Swedish translations in taxonomies
remove unused taxonomies
taxonomy[@xml:id="parla.links"]
I guess you can remove this taxonomy, it was used in CZ corpus and it seems that you don't use it. https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.xml#L265-L279
wrong date in corpus root setting
setting/date
element valuesetting/date/@ana
setting/date
should contain the timespan of the corpus (from
-to
), and if you want to add@ana
attribute, it should contain a list of terms https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.xml#L316Wierd event label
https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.xml#L324-L333
invalid date in parliament organization
from
should start beforeto
.Thanks for this bug. It seems that our validation is not paranoic enough. (@matyaskopp, extend validation)
https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.xml#L328
missing term in parliament organization
There should be three terms in parliament organization. Expecting it owing to:
missing opposition relation
Do you have opposition in the Swedish parliament?
split forename
if someone has multiple names, each should have its own element https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE.ana.xml#L602
should be
component file meeting
The meeting element in the component file should specify the content of file (eg use
parla.sitting
it it contains a sitting day) https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE_201516--19.xml#L8CZ sample: https://github.com/clarin-eric/ParlaMint/blob/47a6a842d5a6447266f3ce0d95ad83bdac66673e/Data/ParlaMint-CZ/ParlaMint-CZ_2016-04-13-ps2013-044-02-013-114.xml#L13-L16
debates beginning
It is possible that I don't understand it. Sittings in your data start with a weird sequence of unknown speakers and notes. @TomazErjavec can you help me with the feedback here? https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE_202122--29.xml#L193
Some notes look similar to some notes...
and even the linguistic annotation is weird for this situations:
missing chairperson
chair
role.speeches split by paragraphs
You are starting a new utterance whenever a new paragraph starts. There is no speaker change... https://github.com/ninpnin/ParlaMint/blob/6b36bd4952aecde322bbd4c52ad4eed7a73e6ec3/Data/ParlaMint-SE/ParlaMint-SE_201516--19.xml#L140-L149
I don't understand the usage of
@next
(referring to the following speech - notu
) andprev
(referring to the first elementu
of a sequence ofu
elements that creates one speech)