clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
51 stars 53 forks source link

BE feedback #496

Open matyaskopp opened 2 years ago

matyaskopp commented 2 years ago

I have just a few observations:

Responsibility for lingv. annotations in TEI version

https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L17-L21

                <respStmt>
                    <persName>Jesse de Does</persName>
                    <resp xml:lang="nl">Taalkundige verrijking</resp>
                    <resp xml:lang="en">Linguistic annotation</resp>
                </respStmt

Wrong date

https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L70

<date from="2015-11-12" to="2022-07-13">2015-11-12 - 022-07-13</date>

Taxonomy fusion

You have invented some new taxonomies, and some common ones are modified. It is needed to unify this in v3.1 EG, you used new categories in parla.legislature

<category xml:id="parla.federal">
  <!-- toegevoegd -->
  <catDesc xml:lang="nl">
    <term>Federaal</term>
  </catDesc>
  <catDesc xml:lang="en">
    <term>Federal</term>
  </catDesc>
</category>

You can check CZ folder for how common taxonomies should look.

wrong idno type

please follow the recommendation here: https://clarin-eric.github.io/ParlaMint/#TEI.idno

https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L408

<idno type="wikimedia" xml:lang="nl">https://nl.wikipedia.org/wiki/Federaal_Parlement_van_Belgi%C3%AB</idno>

should be

<idno type="URI" subtype="wikimedia" xml:lang="nl">https://nl.wikipedia.org/wiki/Federaal_Parlement_van_Belgi%C3%AB</idno>

settingDesc date in corpus root files

The date should contain full corpus period https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L392

            <settingDesc>
                <setting>
                    <name type="city">Brussel</name>
                    <name key="BE" type="country">België</name>
<!-- MISSING from and to -->
                    <date ana="#parla.sitting" when="2016-05-26">2016-05-26</date>
                </setting>
            </settingDesc>

speaker note before speech

It is common to have a speaker note before a speech - it is not a part of the speech. https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.xml#L108

<u ana="#regular" xml:id="ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.u1" who="#BoukiliNabil" xml:lang="nl">
    <note>01.01 Nabil Boukili (PVDA-PTB):</note>

should be

<note type="speaker">01.01 Nabil Boukili (PVDA-PTB):</note>
<u ana="#regular" xml:id="ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.u1" who="#BoukiliNabil" xml:lang="nl">

missing parts of transcriptions

https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.xml#L494-L497

<u ana="#regular" xml:id="ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.u35" who="#VanVaerenberghKristien" xml:lang="nl">
  <note xml:lang="nl">07.02 Kristien Van Vaerenbergh (N-VA):</note>
  <seg xml:id="ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.seg286" xml:lang="nl">Reeds enige tijd is er een groot aantal vacante plaatsen voor de functie van vrederechter op de Brusselse vredegerechten. In uw beleidsverklaring sprak u van extra investeringen in Justitie onder andere op gebied van informatica en het aanwerven van meer personeel.</seg>
</u>

image

missing notes

There are a lot of notes like this:

Het incident is gesloten. L'incident est clos. De openbare commissievergadering wordt gesloten om 17.19 uur. La réunion publique de commission est levée à 17 h 19.

Which is missing in component files

JessedeDoes commented 1 year ago

First the easy ones:

Cf the next comments for the more complex issues.

JessedeDoes commented 1 year ago
JessedeDoes commented 1 year ago

We have tried to do this as much as possible now. When categories we need are missing from the common taxonomy, we add a -BE file with the supplementary categories. In some cases, this could be merged into the common ontology. BTW The common taxonomy directory in the repository does not seem to contain all necessary files. I took the CZ file in these cases.

matyaskopp commented 1 year ago
  • missing text: this has to do with text paragraphs which could not automatically be classified in the first step op the conversion from HTML to TEI. In the first stage, these paragraphs are kept as <p>. The ParlaMint scheme does not allow this, so they were just removed in the last cleaning step to make files parlamint-conformant. In the current version, we keep this content as gaps:
                <gap reason="editorial">
                    <desc>Content could not be classified: 02 Question de Marianne Verhaert à Mathieu Michel (Digitalisation, Simplification administrative, Protection de la vie privée et Régie des Bâtiments) sur &quot;Le portefeuille électronique&quot; (55024726C)</desc>
                </gap>

I am not satisfied with this encoding, but I do not see a satisfactory alternative without altering the ParlaMint scheme.

Does this mean that you are unsure if it is an utterance <u> or stenographer's notes <note>? I believe this should be a note if you are not sure.

  • Using common taxonomies.

We have tried to do this as much as possible now. When categories we need are missing from the common taxonomy, we add a -BE file with the supplementary categories. In some cases, this could be merged into the common ontology. BTW The common taxonomy directory in the repository does not seem to contain all necessary files. I took the CZ file in these cases.

yes, taking CZ taxonomy is ok. But for UD-SYN taxonomy, it is better to use this: https://github.com/clarin-eric/ParlaMint/blob/main/Data/Taxonomies/ParlaMint-taxonomy-UD-SYN.ana.xml This taxonomy is automatically generated from UD documentation and contains all documented relations (even for languages that are not in ParlaMint) https://github.com/clarin-eric/ParlaMint/blob/5deaeed5ae792f3ba1726072298885b5b64a6d64/Makefile#L589-L594

  • In ParlaMint-taxonomy-parla.legislature.xml, we would need categories parla.community, parla.federal, and parla.meeting.committee. These are defined in ParlaMint-BE-taxonomy-parla.legislature.xml

parla taxonomy is meant for parliament plenary speech transcription classification. If you are encoding committee meetings, then you can use your own taxonomy. @TomazErjavec, do we agree on that?

As for the parla.federal category, I think that it should be parla.national and separate parliaments in federation should be parla.regional, so you don't need a parla.federal category.

  • The common taxonomy for speaker types (taken from CZ) is quite limited. Hence we added ParlaMint-BE-taxonomy-speaker_types.xml

Yes, the taxonomy is limited, but it is as it is defined in ParlaMint.

If you want to extend this taxonomy, I guess you should create a new one as you did. But if the minister speaker is seeking, then you should use both taxonomies. (I hope this will not break @TomazErjavec script):

<u ana="#regular #minister" ...>

But remember that this categorization is speaker categorization, so if someone holds a minister position, it does not necessarily mean that he is speaking as a minister (not a regular MP) - in CZ, we are not able to distinguish this from the transcription.

The Dutch and French UD tagging contains relations not in the common UD taxonomy (mostly from French GSD). They are declared in ParlaMint-BE-taxonomy-UD-SYN.ana.xml. This could be added to the common ParlaMint-taxonomy-UD-SYN.ana.xml?

  • The extra relations (except for goeswith and dislocated) are extensions of existing ones. Maybe an encoding like <link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl_mod"/> could become something like <link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl ud-syn-gsd:obl_mod"/> to distinguish common base layer and extension to make the encoding more interoperable?

https://github.com/clarin-eric/ParlaMint/blob/main/Data/Taxonomies/ParlaMint-taxonomy-UD-SYN.ana.xml taxonomy cover these situations: https://github.com/clarin-eric/ParlaMint/blob/5deaeed5ae792f3ba1726072298885b5b64a6d64/Data/Taxonomies/ParlaMint-taxonomy-UD-SYN.ana.xml#L1210-L1213

TomazErjavec commented 1 year ago

I believe this should be a note if you are not sure.

I agree, <note type="editorial"> is better than <gap type="editorial"> .

parla taxonomy is meant for parliament plenary speech transcription classification. If you are encoding committee meetings, then you can use your own taxonomy. @TomazErjavec, do we agree on that?

I don't think so, if we are taliking about ParlaMint-taxonomy-parla.legislature(.xml) , that one contains much more the just plenary speech transcription classification. I would be in favour of adding BE categories in the common taxonomy, as long as they are nicely positioned in it. Didn't have a look yet at your taxonomy though. @matyaskopp, do you see a problem here?

matyaskopp commented 1 year ago

I don't think so, if we are taliking about ParlaMint-taxonomy-parla.legislature(.xml) , that one contains much more the just plenary speech transcription classification. I would be in favour of adding BE categories in the common taxonomy, as long as they are nicely positioned in it. Didn't have a look yet at your taxonomy though. @matyaskopp, do you see a problem here?

I can imagine that we can extend parla.legislature taxonomy. But in fact, I think new BE categories are breaking it a bit. The taxonomy describes multiple points of view:

parla.meeting.committee

So if we have category parla.meeting.committee it is a mixture of temporal and organization type. This should hold two categories parla.meeting(temporal) and parla.committee(organization type)

I do not see a reason for adding parla.meeting.committee because it is a kind of hybrid category.

parla.comunity

We don't have "Flemish or Wallonian community" in CZ. What is this category for?

parla.federal

I think this can be replaced with parla.national, or we can add this category between parla.supranational and parla.national. I think it is morelike "province" point of view (not organization type)

JessedeDoes commented 1 year ago

Multipe speaker types indeed break the validation:

 Error: Type error on line 332 column 49 of parlamint-lib.xsl:
    XTTE0780  A sequence of more than one item is not allowed as the result of a call to
    et:u-role#1 ("Prime Minister", "Regular") 

We interpreted 'regular' as "speaking as member of parliament". If a person holds a minister post at the time of speaker, he/she is not speaking as member of parliament.

But I can map our current speaker types to parlamint, assuming that parliament members, ministers, prime ministers, secretaries of states are "regulars", and the rest are "guest" (incidental speakers)?

JessedeDoes commented 1 year ago

Summarizing:

matyaskopp commented 1 year ago

But I can map our current speaker types to parlamint, assuming that parliament members, ministers, prime ministers, secretaries of states are "regulars", and the rest are "guest" (incidental speakers)?

Yes, it will probably be the best. We need all corpora to be comparable... Thank

TomazErjavec commented 1 year ago

I agree with @matyaskopp, all speakers are regular speakers (like MPs, ministers, prime minister), except invited guests, who are not affiliated with the parliament of government. Adding "#minister" would be redundant anyway, we know somebody is a minister given their affiliation and resolving the affiliation to and from with regard to when a person is speaking.

Gap becomes note for the unclassified paragraphs. Some way to characterize this content would be welcome. Maybe allow subtype="problematic_content" or something along those lines?

I think @type="editorial" covers this anyway to an extent (why would the editor put something into a note unless it was problematic). It would not be much work to add @subtype to note, but I am a bit disinclined to do so, if only BE would be using it (while others have similar cases, which they treat as note/@type="editorial").

As for the taxonomy, I would need to find some quality time to understand the whole thing, which I can't seem to find, sigh. Maybe the weekend...

matyaskopp commented 1 year ago

invalid url format

https://github.com/JessedeDoes/ParlaMint/blob/32213d529bbbb2b28ced35d2a7bfb74c2ba9edd1/Data/ParlaMint-BE/ParlaMint-BE.xml#L70-L78

                    <idno type="URI">https://www.dekamer.be/kvvcr/showpage.cfm?section=/cricra
          &amp;
          language=nl
          &amp;
          cfm=dcricra.cfm?type=plen
          &amp;
          cricra=cri
          &amp;
          count=all</idno>
matyaskopp commented 1 year ago

speeches misclassification

I still don't understand why there are a lot of speeches misclassification. From my point of view (without language knowledge) HTML classes, elements and other attributes can be used.

Describing this: https://github.com/JessedeDoes/ParlaMint/blob/32213d529bbbb2b28ced35d2a7bfb74c2ba9edd1/Data/ParlaMint-BE/ParlaMint-BE_2021-03-30-definitief-55-commissie-ic427x.xml#L174-L177 which corresponds to this place in the source: https://www.dekamer.be/doc/CCRI/html/55/ic427x.html#TN01

regular/guest speeches start with <p> with one of these classes italFR, NormalNL, NormalFR (and probably italNL). Inside these <p>, there are:

There are also chairman speeches that do not follow upper rules, but you have correctly identified them.

notes do not contain xml:lang

strange xml directoves

if you want to use <note type="editorial"> then please remove <? ?> which is strange: https://github.com/JessedeDoes/ParlaMint/blob/32213d529bbbb2b28ced35d2a7bfb74c2ba9edd1/Data/ParlaMint-BE/ParlaMint-BE_2021-06-03-definitief-55-plenair-ip107x.xml#L401-L403

                <note type="editorial">
                    <?uncertain_content_classification Could be possibly be speaker text?>
                    L'incident est clos.
                </note>

But I prefer not to use it, at least in the case above, that can be encoded better:

<note type="comment" xml:lang="fr">L'incident est clos.</note>

or:

<note type="narrative" xml:lang="fr">L'incident est clos.</note>
JessedeDoes commented 1 year ago
TomazErjavec commented 1 year ago

@JessedeDoes, in 77e8d95 I've added parla.meeting.committee to the general taxonomy. I'm not absolutely sure if the category belongs where I put it but it might be good enough for now. So, could you copy the new category into your general ParlaMint-taxonomy-parla.legislature taxonomy and remove you additinal taxonomy pls?

https://github.com/clarin-eric/ParlaMint/blob/77e8d952526ebcaf73f075c7a64b72071ae41be3/Data/Taxonomies/ParlaMint-taxonomy-parla.legislature.xml#L225-L233