Open matyaskopp opened 2 years ago
First the easy ones:
Cf the next comments for the more complex issues.
<p>
. The ParlaMint scheme does not allow this, so they were just removed in the last cleaning step to make files parlamint-conformant. In the current version, we keep this content as gaps:
<gap reason="editorial">
<desc>Content could not be classified: 02 Question de Marianne Verhaert à Mathieu Michel (Digitalisation, Simplification administrative, Protection de la vie privée et Régie des Bâtiments) sur "Le portefeuille électronique" (55024726C)</desc>
</gap>
I am not satisfied with this encoding, but I do not see a satisfactory alternative without altering the ParlaMint scheme.
We have tried to do this as much as possible now. When categories we need are missing from the common taxonomy, we add a -BE file with the supplementary categories. In some cases, this could be merged into the common ontology. BTW The common taxonomy directory in the repository does not seem to contain all necessary files. I took the CZ file in these cases.
goeswith
and dislocated
) are extensions of existing ones. Maybe an encoding like
<link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl_mod"/>
could become something like
<link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl ud-syn-gsd:obl_mod"/>
to distinguish common base layer and extension to make the encoding more interoperable?
- missing text: this has to do with text paragraphs which could not automatically be classified in the first step op the conversion from HTML to TEI. In the first stage, these paragraphs are kept as
<p>
. The ParlaMint scheme does not allow this, so they were just removed in the last cleaning step to make files parlamint-conformant. In the current version, we keep this content as gaps:<gap reason="editorial"> <desc>Content could not be classified: 02 Question de Marianne Verhaert à Mathieu Michel (Digitalisation, Simplification administrative, Protection de la vie privée et Régie des Bâtiments) sur "Le portefeuille électronique" (55024726C)</desc> </gap>
I am not satisfied with this encoding, but I do not see a satisfactory alternative without altering the ParlaMint scheme.
Does this mean that you are unsure if it is an utterance <u>
or stenographer's notes <note>
? I believe this should be a note
if you are not sure.
- Using common taxonomies.
We have tried to do this as much as possible now. When categories we need are missing from the common taxonomy, we add a -BE file with the supplementary categories. In some cases, this could be merged into the common ontology. BTW The common taxonomy directory in the repository does not seem to contain all necessary files. I took the CZ file in these cases.
yes, taking CZ taxonomy is ok. But for UD-SYN taxonomy, it is better to use this: https://github.com/clarin-eric/ParlaMint/blob/main/Data/Taxonomies/ParlaMint-taxonomy-UD-SYN.ana.xml This taxonomy is automatically generated from UD documentation and contains all documented relations (even for languages that are not in ParlaMint) https://github.com/clarin-eric/ParlaMint/blob/5deaeed5ae792f3ba1726072298885b5b64a6d64/Makefile#L589-L594
- In ParlaMint-taxonomy-parla.legislature.xml, we would need categories parla.community, parla.federal, and parla.meeting.committee. These are defined in ParlaMint-BE-taxonomy-parla.legislature.xml
parla
taxonomy is meant for parliament plenary speech transcription classification. If you are encoding committee meetings, then you can use your own taxonomy.
@TomazErjavec, do we agree on that?
As for the parla.federal
category, I think that it should be parla.national
and separate parliaments in federation should be parla.regional
, so you don't need a parla.federal
category.
- The common taxonomy for speaker types (taken from CZ) is quite limited. Hence we added ParlaMint-BE-taxonomy-speaker_types.xml
Yes, the taxonomy is limited, but it is as it is defined in ParlaMint.
If you want to extend this taxonomy, I guess you should create a new one as you did. But if the minister speaker is seeking, then you should use both taxonomies. (I hope this will not break @TomazErjavec script):
<u ana="#regular #minister" ...>
But remember that this categorization is speaker categorization, so if someone holds a minister position, it does not necessarily mean that he is speaking as a minister (not a regular MP) - in CZ, we are not able to distinguish this from the transcription.
The Dutch and French UD tagging contains relations not in the common UD taxonomy (mostly from French GSD). They are declared in ParlaMint-BE-taxonomy-UD-SYN.ana.xml. This could be added to the common ParlaMint-taxonomy-UD-SYN.ana.xml?
- The extra relations (except for
goeswith
anddislocated
) are extensions of existing ones. Maybe an encoding like<link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl_mod"/>
could become something like<link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl ud-syn-gsd:obl_mod"/>
to distinguish common base layer and extension to make the encoding more interoperable?
https://github.com/clarin-eric/ParlaMint/blob/main/Data/Taxonomies/ParlaMint-taxonomy-UD-SYN.ana.xml taxonomy cover these situations: https://github.com/clarin-eric/ParlaMint/blob/5deaeed5ae792f3ba1726072298885b5b64a6d64/Data/Taxonomies/ParlaMint-taxonomy-UD-SYN.ana.xml#L1210-L1213
dep
I believe this should be a note if you are not sure.
I agree, <note type="editorial">
is better than <gap type="editorial">
.
parla taxonomy is meant for parliament plenary speech transcription classification. If you are encoding committee meetings, then you can use your own taxonomy. @TomazErjavec, do we agree on that?
I don't think so, if we are taliking about ParlaMint-taxonomy-parla.legislature(.xml) , that one contains much more the just plenary speech transcription classification. I would be in favour of adding BE categories in the common taxonomy, as long as they are nicely positioned in it. Didn't have a look yet at your taxonomy though. @matyaskopp, do you see a problem here?
I don't think so, if we are taliking about ParlaMint-taxonomy-parla.legislature(.xml) , that one contains much more the just plenary speech transcription classification. I would be in favour of adding BE categories in the common taxonomy, as long as they are nicely positioned in it. Didn't have a look yet at your taxonomy though. @matyaskopp, do you see a problem here?
I can imagine that we can extend parla.legislature
taxonomy. But in fact, I think new BE categories are breaking it a bit. The taxonomy describes multiple points of view:
parla.meeting.committee
So if we have category parla.meeting.committee
it is a mixture of temporal and organization type. This should hold two categories parla.meeting
(temporal) and parla.committee
(organization type)
I do not see a reason for adding parla.meeting.committee
because it is a kind of hybrid category.
parla.comunity
We don't have "Flemish or Wallonian community" in CZ. What is this category for?
parla.federal
I think this can be replaced with parla.national
, or we can add this category between parla.supranational
and parla.national
. I think it is morelike "province" point of view (not organization type)
Multipe speaker types indeed break the validation:
Error: Type error on line 332 column 49 of parlamint-lib.xsl:
XTTE0780 A sequence of more than one item is not allowed as the result of a call to
et:u-role#1 ("Prime Minister", "Regular")
We interpreted 'regular' as "speaking as member of parliament". If a person holds a minister post at the time of speaker, he/she is not speaking as member of parliament.
But I can map our current speaker types to parlamint, assuming that parliament members, ministers, prime ministers, secretaries of states are "regulars", and the rest are "guest" (incidental speakers)?
Summarizing:
subtype="problematic_content"
or something along those lines? But I can map our current speaker types to parlamint, assuming that parliament members, ministers, prime ministers, secretaries of states are "regulars", and the rest are "guest" (incidental speakers)?
Yes, it will probably be the best. We need all corpora to be comparable... Thank
I agree with @matyaskopp, all speakers are regular speakers (like MPs, ministers, prime minister), except invited guests, who are not affiliated with the parliament of government. Adding "#minister" would be redundant anyway, we know somebody is a minister given their affiliation and resolving the affiliation to and from with regard to when a person is speaking.
Gap becomes note for the unclassified paragraphs. Some way to characterize this content would be welcome. Maybe allow subtype="problematic_content" or something along those lines?
I think @type="editorial"
covers this anyway to an extent (why would the editor put something into a note unless it was problematic). It would not be much work to add @subtype
to note, but I am a bit disinclined to do so, if only BE would be using it (while others have similar cases, which they treat as note/@type="editorial"
).
As for the taxonomy, I would need to find some quality time to understand the whole thing, which I can't seem to find, sigh. Maybe the weekend...
<idno type="URI">https://www.dekamer.be/kvvcr/showpage.cfm?section=/cricra
&
language=nl
&
cfm=dcricra.cfm?type=plen
&
cricra=cri
&
count=all</idno>
I still don't understand why there are a lot of speeches misclassification. From my point of view (without language knowledge) HTML classes, elements and other attributes can be used.
Describing this: https://github.com/JessedeDoes/ParlaMint/blob/32213d529bbbb2b28ced35d2a7bfb74c2ba9edd1/Data/ParlaMint-BE/ParlaMint-BE_2021-03-30-definitief-55-commissie-ic427x.xml#L174-L177 which corresponds to this place in the source: https://www.dekamer.be/doc/CCRI/html/55/ic427x.html#TN01
regular
/guest
speeches start with <p>
with one of these classes italFR
, NormalNL
, NormalFR
(and probably italNL
). Inside these <p>
, there are:
<a name="TN01"></a>
where TN01
is speech number (you can use this anchor in @source
attribute - see CZ <u>
elements)<span class="oraspr">...
which contains number of speech ({topic}.{speech in topic}
) and speaker name(party):
in following span
It looks like that speech ends when <p class=italNL><a name=T016></a><span lang=NL>Het incident is gesloten.</span></p>
<p class=italFR><span lang=FR-BE>L'incident est clos.</span></p>
<p class=MsoNormal>...</p>
<p class=Titre2NL>...
so only the beginning of the meeting and new topic before the first speech can contain unclassified notes or you can classify them as <note type="comment">...</note>
There are also chairman speeches that do not follow upper rules, but you have correctly identified them.
xml:lang
[ ] xml:lang
in <note>
this is available in source HTML
<? ?>
if you want to use <note type="editorial">
then please remove <? ?>
which is strange:
https://github.com/JessedeDoes/ParlaMint/blob/32213d529bbbb2b28ced35d2a7bfb74c2ba9edd1/Data/ParlaMint-BE/ParlaMint-BE_2021-06-03-definitief-55-plenair-ip107x.xml#L401-L403
<note type="editorial">
<?uncertain_content_classification Could be possibly be speaker text?>
L'incident est clos.
</note>
But I prefer not to use it, at least in the case above, that can be encoded better:
<note type="comment" xml:lang="fr">L'incident est clos.</note>
or:
<note type="narrative" xml:lang="fr">L'incident est clos.</note>
<p>
elements, so echoing it on the notes is not a problem@JessedeDoes, in 77e8d95 I've added parla.meeting.committee to the general taxonomy. I'm not absolutely sure if the category belongs where I put it but it might be good enough for now. So, could you copy the new category into your general ParlaMint-taxonomy-parla.legislature taxonomy and remove you additinal taxonomy pls?
I have just a few observations:
Responsibility for lingv. annotations in TEI version
https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L17-L21
Wrong date
https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L70
Taxonomy fusion
You have invented some new taxonomies, and some common ones are modified. It is needed to unify this in v3.1 EG, you used new categories in
parla.legislature
You can check CZ folder for how common taxonomies should look.
wrong idno type
please follow the recommendation here: https://clarin-eric.github.io/ParlaMint/#TEI.idno
https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L408
should be
settingDesc date in corpus root files
ana="#parla.sitting"
from corpus root filesThe
date
should contain full corpus period https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L392speaker note before speech
type="speaker"
It is common to have a speaker note before a speech - it is not a part of the speech. https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.xml#L108
should be
missing parts of transcriptions
https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.xml#L494-L497
missing notes
There are a lot of notes like this:
Which is missing in component files