BE feedback - Githubissues

matyaskopp commented 2 years ago

I have just a few observations:

Responsibility for lingv. annotations in TEI version

[x] remove linguistic annotation responsibility from TEI version

https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L17-L21

                <respStmt>
                    <persName>Jesse de Does</persName>
                    <resp xml:lang="nl">Taalkundige verrijking</resp>
                    <resp xml:lang="en">Linguistic annotation</resp>
                </respStmt

Wrong date

[x] fix date in text

https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L70

<date from="2015-11-12" to="2022-07-13">2015-11-12 - 022-07-13</date>

Taxonomy fusion

[ ] use common taxonomies without modification, just add translations

You have invented some new taxonomies, and some common ones are modified. It is needed to unify this in v3.1 EG, you used new categories in parla.legislature

<category xml:id="parla.federal">
  <!-- toegevoegd -->
  <catDesc xml:lang="nl">
    <term>Federaal</term>
  </catDesc>
  <catDesc xml:lang="en">
    <term>Federal</term>
  </catDesc>
</category>

You can check CZ folder for how common taxonomies should look.

wrong idno type

[x] idno type

please follow the recommendation here: https://clarin-eric.github.io/ParlaMint/#TEI.idno

https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L408

<idno type="wikimedia" xml:lang="nl">https://nl.wikipedia.org/wiki/Federaal_Parlement_van_Belgi%C3%AB</idno>

should be

<idno type="URI" subtype="wikimedia" xml:lang="nl">https://nl.wikipedia.org/wiki/Federaal_Parlement_van_Belgi%C3%AB</idno>

settingDesc date in corpus root files

[x] settingDesc date
[x] remove ana="#parla.sitting" from corpus root files

The date should contain full corpus period https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE.xml#L392

            <settingDesc>
                <setting>
                    <name type="city">Brussel</name>
                    <name key="BE" type="country">België</name>
<!-- MISSING from and to -->
                    <date ana="#parla.sitting" when="2016-05-26">2016-05-26</date>
                </setting>
            </settingDesc>

speaker note before speech

[x] missing annotation type="speaker"
[x] move before speech

It is common to have a speaker note before a speech - it is not a part of the speech. https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.xml#L108

<u ana="#regular" xml:id="ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.u1" who="#BoukiliNabil" xml:lang="nl">
    <note>01.01 Nabil Boukili (PVDA-PTB):</note>

should be

<note type="speaker">01.01 Nabil Boukili (PVDA-PTB):</note>
<u ana="#regular" xml:id="ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.u1" who="#BoukiliNabil" xml:lang="nl">

missing parts of transcriptions

[x] missing content?

https://github.com/JessedeDoes/ParlaMint/blob/1f0a9d3ef52e8a2aad8b3733dc1cc742bce4f0fe/Data/ParlaMint-BE/ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.xml#L494-L497

<u ana="#regular" xml:id="ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.u35" who="#VanVaerenberghKristien" xml:lang="nl">
  <note xml:lang="nl">07.02 Kristien Van Vaerenbergh (N-VA):</note>
  <seg xml:id="ParlaMint-BE_2021-09-22-definitief-55-commissie-ic577x.seg286" xml:lang="nl">Reeds enige tijd is er een groot aantal vacante plaatsen voor de functie van vrederechter op de Brusselse vredegerechten. In uw beleidsverklaring sprak u van extra investeringen in Justitie onder andere op gebied van informatica en het aanwerven van meer personeel.</seg>
</u>

missing notes

[x] missing notes

There are a lot of notes like this:

Het incident is gesloten. L'incident est clos. De openbare commissievergadering wordt gesloten om 17.19 uur. La réunion publique de commission est levée à 17 h 19.

Which is missing in component files

JessedeDoes commented 1 year ago

First the easy ones:

We fixed the validation issue found by Tomaz in one of the files
We removed the resp statement for linguistic annotation from the annotated files
Wrong dates are corrected (also in settingDesc)
idno type is corrected
speaker note is moved to be before speech

Cf the next comments for the more complex issues.

JessedeDoes commented 1 year ago

missing text: this has to do with text paragraphs which could not automatically be classified in the first step op the conversion from HTML to TEI. In the first stage, these paragraphs are kept as . The ParlaMint scheme does not allow this, so they were just removed in the last cleaning step to make files parlamint-conformant. In the current version, we keep this content as gaps:
```
 <gap reason="editorial">
 <desc>Content could not be classified: 02 Question de Marianne Verhaert à Mathieu Michel (Digitalisation, Simplification administrative, Protection de la vie privée et Régie des Bâtiments) sur &quot;Le portefeuille électronique&quot; (55024726C)</desc>
 </gap>
```
I am not satisfied with this encoding, but I do not see a satisfactory alternative without altering the ParlaMint scheme.

JessedeDoes commented 1 year ago

Using common taxonomies.

We have tried to do this as much as possible now. When categories we need are missing from the common taxonomy, we add a -BE file with the supplementary categories. In some cases, this could be merged into the common ontology. BTW The common taxonomy directory in the repository does not seem to contain all necessary files. I took the CZ file in these cases.

In ParlaMint-taxonomy-parla.legislature.xml, we would need categories parla.community, parla.federal, and parla.meeting.committee. These are defined in ParlaMint-BE-taxonomy-parla.legislature.xml
The common taxonomy for speaker types (taken from CZ) is quite limited. Hence we added ParlaMint-BE-taxonomy-speaker_types.xml
The Dutch and French UD tagging contains relations not in the common UD taxonomy (mostly from French GSD). They are declared in ParlaMint-BE-taxonomy-UD-SYN.ana.xml. This could be added to the common ParlaMint-taxonomy-UD-SYN.ana.xml?
- The extra relations (except for goeswith and dislocated) are extensions of existing ones. Maybe an encoding like <link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl_mod"/> could become something like <link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl ud-syn-gsd:obl_mod"/> to distinguish common base layer and extension to make the encoding more interoperable?

matyaskopp commented 1 year ago

missing text: this has to do with text paragraphs which could not automatically be classified in the first step op the conversion from HTML to TEI. In the first stage, these paragraphs are kept as . The ParlaMint scheme does not allow this, so they were just removed in the last cleaning step to make files parlamint-conformant. In the current version, we keep this content as gaps:
 <gap reason="editorial">
 <desc>Content could not be classified: 02 Question de Marianne Verhaert à Mathieu Michel (Digitalisation, Simplification administrative, Protection de la vie privée et Régie des Bâtiments) sur &quot;Le portefeuille électronique&quot; (55024726C)</desc>
 </gap>
I am not satisfied with this encoding, but I do not see a satisfactory alternative without altering the ParlaMint scheme.

Does this mean that you are unsure if it is an utterance  or stenographer's notes <note>? I believe this should be a note if you are not sure.

Using common taxonomies.

We have tried to do this as much as possible now. When categories we need are missing from the common taxonomy, we add a -BE file with the supplementary categories. In some cases, this could be merged into the common ontology. BTW The common taxonomy directory in the repository does not seem to contain all necessary files. I took the CZ file in these cases.

yes, taking CZ taxonomy is ok. But for UD-SYN taxonomy, it is better to use this: https://github.com/clarin-eric/ParlaMint/blob/main/Data/Taxonomies/ParlaMint-taxonomy-UD-SYN.ana.xml This taxonomy is automatically generated from UD documentation and contains all documented relations (even for languages that are not in ParlaMint) https://github.com/clarin-eric/ParlaMint/blob/5deaeed5ae792f3ba1726072298885b5b64a6d64/Makefile#L589-L594

In ParlaMint-taxonomy-parla.legislature.xml, we would need categories parla.community, parla.federal, and parla.meeting.committee. These are defined in ParlaMint-BE-taxonomy-parla.legislature.xml

parla taxonomy is meant for parliament plenary speech transcription classification. If you are encoding committee meetings, then you can use your own taxonomy. @TomazErjavec, do we agree on that?

As for the parla.federal category, I think that it should be parla.national and separate parliaments in federation should be parla.regional, so you don't need a parla.federal category.

The common taxonomy for speaker types (taken from CZ) is quite limited. Hence we added ParlaMint-BE-taxonomy-speaker_types.xml

Yes, the taxonomy is limited, but it is as it is defined in ParlaMint.

regular = members of parliament and government
chair = chair of the meeting/sitting
guest = the any other

If you want to extend this taxonomy, I guess you should create a new one as you did. But if the minister speaker is seeking, then you should use both taxonomies. (I hope this will not break @TomazErjavec script):

<u ana="#regular #minister" ...>

But remember that this categorization is speaker categorization, so if someone holds a minister position, it does not necessarily mean that he is speaking as a minister (not a regular MP) - in CZ, we are not able to distinguish this from the transcription.

The Dutch and French UD tagging contains relations not in the common UD taxonomy (mostly from French GSD). They are declared in ParlaMint-BE-taxonomy-UD-SYN.ana.xml. This could be added to the common ParlaMint-taxonomy-UD-SYN.ana.xml?

The extra relations (except for goeswith and dislocated) are extensions of existing ones. Maybe an encoding like <link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl_mod"/> could become something like <link target="#ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1061 #ParlaMint-BE_2022-07-07-voorlopig-55-plenair-ip193x.w1065" ana="ud-syn:obl ud-syn-gsd:obl_mod"/> to distinguish common base layer and extension to make the encoding more interoperable?

https://github.com/clarin-eric/ParlaMint/blob/main/Data/Taxonomies/ParlaMint-taxonomy-UD-SYN.ana.xml taxonomy cover these situations: https://github.com/clarin-eric/ParlaMint/blob/5deaeed5ae792f3ba1726072298885b5b64a6d64/Data/Taxonomies/ParlaMint-taxonomy-UD-SYN.ana.xml#L1210-L1213

if not
- missing documentation - should be reported here: https://github.com/UniversalDependencies/docs/issues
- bug in annotation tool (/training data) - provides relation that does not exist -> should be replaced with universal relation dep

TomazErjavec commented 1 year ago

I believe this should be a note if you are not sure.

I agree, <note type="editorial"> is better than <gap type="editorial"> .

parla taxonomy is meant for parliament plenary speech transcription classification. If you are encoding committee meetings, then you can use your own taxonomy. @TomazErjavec, do we agree on that?

I don't think so, if we are taliking about ParlaMint-taxonomy-parla.legislature(.xml) , that one contains much more the just plenary speech transcription classification. I would be in favour of adding BE categories in the common taxonomy, as long as they are nicely positioned in it. Didn't have a look yet at your taxonomy though. @matyaskopp, do you see a problem here?

matyaskopp commented 1 year ago

I don't think so, if we are taliking about ParlaMint-taxonomy-parla.legislature(.xml) , that one contains much more the just plenary speech transcription classification. I would be in favour of adding BE categories in the common taxonomy, as long as they are nicely positioned in it. Didn't have a look yet at your taxonomy though. @matyaskopp, do you see a problem here?

I can imagine that we can extend parla.legislature taxonomy. But in fact, I think new BE categories are breaking it a bit. The taxonomy describes multiple points of view:

`parla.meeting.committee`

So if we have category parla.meeting.committee it is a mixture of temporal and organization type. This should hold two categories parla.meeting(temporal) and parla.committee(organization type)

I do not see a reason for adding parla.meeting.committee because it is a kind of hybrid category.

`parla.comunity`

We don't have "Flemish or Wallonian community" in CZ. What is this category for?

`parla.federal`

I think this can be replaced with parla.national, or we can add this category between parla.supranational and parla.national. I think it is morelike "province" point of view (not organization type)

JessedeDoes commented 1 year ago

Multipe speaker types indeed break the validation:

 Error: Type error on line 332 column 49 of parlamint-lib.xsl:
    XTTE0780  A sequence of more than one item is not allowed as the result of a call to
    et:u-role#1 ("Prime Minister", "Regular")

We interpreted 'regular' as "speaking as member of parliament". If a person holds a minister post at the time of speaker, he/she is not speaking as member of parliament.

But I can map our current speaker types to parlamint, assuming that parliament members, ministers, prime ministers, secretaries of states are "regulars", and the rest are "guest" (incidental speakers)?

JessedeDoes commented 1 year ago

Summarizing:

Gap becomes note for the unclassified paragraphs. Some way to characterize this content would be welcome. Maybe allow subtype="problematic_content" or something along those lines?
We removed some unnecessary information from the taxonomies
Indeed the other UD relation declaration file contains all we need
The samples now pass the github validation

matyaskopp commented 1 year ago

But I can map our current speaker types to parlamint, assuming that parliament members, ministers, prime ministers, secretaries of states are "regulars", and the rest are "guest" (incidental speakers)?

Yes, it will probably be the best. We need all corpora to be comparable... Thank

TomazErjavec commented 1 year ago

I agree with @matyaskopp, all speakers are regular speakers (like MPs, ministers, prime minister), except invited guests, who are not affiliated with the parliament of government. Adding "#minister" would be redundant anyway, we know somebody is a minister given their affiliation and resolving the affiliation to and from with regard to when a person is speaking.

Gap becomes note for the unclassified paragraphs. Some way to characterize this content would be welcome. Maybe allow subtype="problematic_content" or something along those lines?

I think @type="editorial" covers this anyway to an extent (why would the editor put something into a note unless it was problematic). It would not be much work to add @subtype to note, but I am a bit disinclined to do so, if only BE would be using it (while others have similar cases, which they treat as note/@type="editorial").

As for the taxonomy, I would need to find some quality time to understand the whole thing, which I can't seem to find, sigh. Maybe the weekend...

matyaskopp commented 1 year ago

invalid url format

[x] fix urls

https://github.com/JessedeDoes/ParlaMint/blob/32213d529bbbb2b28ced35d2a7bfb74c2ba9edd1/Data/ParlaMint-BE/ParlaMint-BE.xml#L70-L78

                    <idno type="URI">https://www.dekamer.be/kvvcr/showpage.cfm?section=/cricra
          &amp;
          language=nl
          &amp;
          cfm=dcricra.cfm?type=plen
          &amp;
          cricra=cri
          &amp;
          count=all</idno>

matyaskopp commented 1 year ago

speeches misclassification

[ ] speeches misclassification

I still don't understand why there are a lot of speeches misclassification. From my point of view (without language knowledge) HTML classes, elements and other attributes can be used.

Describing this: https://github.com/JessedeDoes/ParlaMint/blob/32213d529bbbb2b28ced35d2a7bfb74c2ba9edd1/Data/ParlaMint-BE/ParlaMint-BE_2021-03-30-definitief-55-commissie-ic427x.xml#L174-L177 which corresponds to this place in the source: https://www.dekamer.be/doc/CCRI/html/55/ic427x.html#TN01

regular/guest speeches start with  with one of these classes italFR, NormalNL, NormalFR (and probably italNL). Inside these , there are:

(optionally) <a name="TN01"></a> where TN01 is speech number (you can use this anchor in @source attribute - see CZ  elements)
one or two ... which contains number of speech ({topic}.{speech in topic}) and speaker name
it is followed by (party): in following span It looks like that speech ends when
a new speech start
or topic is changed (https://www.dekamer.be/doc/CCRI/html/55/ic427x.html#T016), this sequence of elements
```
<a name=T016></a>Het incident is gesloten.
L'incident est clos.
...
...
```
so only the beginning of the meeting and new topic before the first speech can contain unclassified notes or you can classify them as <note type="comment">...</note>

There are also chairman speeches that do not follow upper rules, but you have correctly identified them.

notes do not contain `xml:lang`

[ ] xml:lang in <note>
this is available in source HTML

strange xml directoves

[ ] remove directives inside xml document <? ?>

if you want to use <note type="editorial"> then please remove <? ?> which is strange: https://github.com/JessedeDoes/ParlaMint/blob/32213d529bbbb2b28ced35d2a7bfb74c2ba9edd1/Data/ParlaMint-BE/ParlaMint-BE_2021-06-03-definitief-55-plenair-ip107x.xml#L401-L403

                <note type="editorial">
                    <?uncertain_content_classification Could be possibly be speaker text?>
                    L'incident est clos.
                </note>

But I prefer not to use it, at least in the case above, that can be encoded better:

<note type="comment" xml:lang="fr">L'incident est clos.</note>

or:

<note type="narrative" xml:lang="fr">L'incident est clos.</note>

JessedeDoes commented 1 year ago

Fixing the URL (strange effect of automatic script reformatting in intellij) will be easy
The xml:lang was present on the  elements, so echoing it on the notes is not a problem
The idea with the processing instruction was to mark these cases as a todo for further processing.
Yes, surely the classification of content can be improved. Currently, we do not have any developer with time to work on this refinement; we would prefer to postpone this to a later stage when we will revisit the whole pipeline in order to minimize the amount of manual supervision, so it can run continuously on new available data instead of the current bursty approach

TomazErjavec commented 1 year ago

@JessedeDoes, in 77e8d95 I've added parla.meeting.committee to the general taxonomy. I'm not absolutely sure if the category belongs where I put it but it might be good enough for now. So, could you copy the new category into your general ParlaMint-taxonomy-parla.legislature taxonomy and remove you additinal taxonomy pls?

https://github.com/clarin-eric/ParlaMint/blob/77e8d952526ebcaf73f075c7a64b72071ae41be3/Data/Taxonomies/ParlaMint-taxonomy-parla.legislature.xml#L225-L233

clarin-eric / ParlaMint

BE feedback #496

Responsibility for lingv. annotations in TEI version

Wrong date

Taxonomy fusion

wrong idno type

settingDesc date in corpus root files

speaker note before speech

missing parts of transcriptions

missing notes

`parla.meeting.committee`

`parla.comunity`

`parla.federal`

invalid url format

speeches misclassification

notes do not contain `xml:lang`

strange xml directoves

clarin-eric / ParlaMint

BE feedback #496

Responsibility for lingv. annotations in TEI version

Wrong date

Taxonomy fusion

wrong idno type

settingDesc date in corpus root files

speaker note before speech

missing parts of transcriptions

missing notes

parla.meeting.committee

parla.comunity

parla.federal

invalid url format

speeches misclassification

notes do not contain xml:lang

strange xml directoves

`parla.meeting.committee`

`parla.comunity`

`parla.federal`

notes do not contain `xml:lang`