clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
51 stars 53 forks source link

PT: feedback #332

Closed matyaskopp closed 1 year ago

matyaskopp commented 2 years ago

This issue tries to summarize changes that are needed to be done to have better metadata in PT corpus. (commit 94489ada469c2ea1f548c911f65fcca85c680ec4)

Firstly I want to note that if the corpus is valid, then it should pass all validations (if there is not a bug in the validation script). But the opposite implication is not true! So passing validation doesn't necessarily mean that the corpus is valid.

ParlaMint structure and metadata

person name

You use term element in persName:

java -cp /usr/share/java/saxon.jar net.sf.saxon.Query -xi:off \!method=adaptive \
            -qs:'//*[local-name()="persName"]/*/local-name()' \
            -s:Data/ParlaMint-PT/ParlaMint-PT.xml \
            | sort | uniq -c
    723 "term"

The term is meant to be used if every other option fails: https://clarin-eric.github.io/ParlaMint/#TEI.persName

Note Special persons (like 'anonymous', 'group' etc.) have their name in <term>.

So the correct encoding of person's name is to use forename and surname in listPerson/person context

affiliation roles

java -cp /usr/share/java/saxon.jar net.sf.saxon.Query -xi:off \!method=adaptive \
            -qs:'//*[local-name()="affiliation"]/@role' \
            -s:Data/ParlaMint-PT/ParlaMint-PT.xml \
            | sort | uniq -c
   2386 role="member"

You use only member role in affiliations. there is no minister or head role. I guess you have at least minister role information because there are affiliations with government. Remember that if someone is affiliated with minister role one should be also affiliated with member role for same period as well: https://clarin-eric.github.io/ParlaMint/#sec-affiliation

An important point to note is that ParlaMint makes no assumptions on the interconnection between various roles, e.g. we do not assume that if somebody has a minister role in the government that they are also a member of the government. Therefore it is necessary to specify all the desired affiliations with their particular roles, e.g. both as minister and as member.

taxonomies

There are taxonomies that I don't understand, ie: https://github.com/cluljoseaires/ParlaMint/blob/94489ada469c2ea1f548c911f65fcca85c680ec4/Data/ParlaMint-PT/ParlaMint-PT.ana.xml#L127-L145

                <taxonomy xml:id="parla.legislature">
                    <desc xml:lang="en">
                        <term>Legislature</term>
                    </desc>
                    <category xml:id="XII">
                        <catDesc>
                            <term />
                        </catDesc>
                    </category>
                    <category xml:id="XIII">
                        <catDesc>
                            <term />
                        </catDesc>
                    </category>
                    <category xml:id="XIV">
                        <catDesc>
                            <term />
                        </catDesc>
                    </category>

You are supposed to take common taxonomies and translate them into the corpus language. You can start with taxonomies from SI corpus:https://github.com/clarin-eric/ParlaMint/blob/5e4fdd6d638bd22bbf121ffcb605cd76ebf952b4/Data/ParlaMint-SI/ParlaMint-SI.ana.xml#L117-L125

We will then merge (on @xml:id attribute) these translations into one common taxonomy (https://github.com/clarin-eric/ParlaMint/issues/264)

meeting element

meeting elements do not provide enough information. See examples: https://clarin-eric.github.io/ParlaMint/#TEI.meeting Also component files need fixings

setting (and also bibl )in component files

setting in a component file should describe the content of the component file, not the whole corpus: https://github.com/cluljoseaires/ParlaMint/blob/94489ada469c2ea1f548c911f65fcca85c680ec4/Data/ParlaMint-PT/ParlaMint-PT_darl12sl04n034-07-01-2015.xml#L68

for this file ParlaMint-PT_darl12sl04n034-07-01-2015.xml it should be

                <setting>
                    <name type="address">Alameda da Universidade, 1600-214 Lisboa, Portugal</name>
                    <name type="city">Lisboa</name>
                    <name type="country" key="PT">Portugal</name>
                    <!--BUG: <date from="2015-11-06" to="2020-08-18">06.11.2015 - 18.08.2020</date> -->
                    <date when="2015-01-07">07.01.2015</date>
                </setting>

strangeness in data (my observations)

corpus timespan (end)

I believe that at least settingDesc should correspond to the corpus timespan: (https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-settingDesc.html)

<settingDesc> (setting description) describes the setting or settings within which a language interaction takes place, or other places otherwise referred to in a text, edition, or metadata. [15.2 Contextual Information 2.4 The Profile Description]

affiliation of dead person

https://github.com/cluljoseaires/ParlaMint/blob/94489ada469c2ea1f548c911f65fcca85c680ec4/Data/ParlaMint-PT/ParlaMint-PT.ana.xml#L1109-L1120

                    <person xml:id="AntónioAndrédaSilvaTopa">
                        <persName>
                            <term>António André da Silva Topa</term>
                        </persName>
                        <sex value="M" />
                        <birth when="1954-09-02" />
                        <death when="2021-10-31" />
                        <affiliation role="member" ref="#PSD" />
                        <affiliation role="member" ref="#GrupoParlamentardoPsd" />
                        <affiliation role="member" ref="#AR" from="2015-10-23" to="2019-10-24" />
                        <affiliation role="member" ref="#AR" from="2019-10-25" to="2022-03-28" />
                    </person>
amamendes commented 2 years ago

Thanks Matyás, we're going to look into it.

cluljoseaires commented 2 years ago

Thank you so much for your tips. I already suspected the validation process would not be enough to have a finished product, but at least we have that out of the way.

Considering your observations, I believe I can fix the issues relatively easily since I already have the data in the cases where it is missing. I did not include it simply because I forgot after so many iterations to make the tests iteratively pass in this overwhelming task.

Once I have taken care of the issues, I will do another pull request.

matyaskopp commented 2 years ago

Considering your observations, I believe I can fix the issues relatively easily since I already have the data in the cases where it is missing. I did not include it simply because I forgot after so many iterations to make the tests iteratively pass in this overwhelming task.

I am glad to hear that it wouldn't be a big issue.

I have one more observation:

bibl link

It would be great to have a direct link to transcription in component files. You have one common link for all files, but it is better to use the real source of data.

You can check CZ data, where I am preserving a lot of link types

matyaskopp commented 2 years ago

Nice work, I am adding a few more observations

replace Presidency organization + role

discussed here: https://github.com/clarin-eric/ParlaMint/issues/305 https://github.com/cluljoseaires/ParlaMint/blob/369dc01e5b65b5a1719e487732b17b32512140ac/Data/ParlaMint-PT/ParlaMint-PT.xml#L185-L196

                    <org xml:id="Presidency" role="institution">
                        <orgName xml:lang="pt" full="yes">Presidência</orgName>
                        <orgName xml:lang="pt" full="yes">Presidência da República Portuguesa</orgName>
                        <listEvent>
                            <event xml:id="P.XIX" from="2006-03-09" to="2016-03-08">
                                <label xml:lang="pt">XIX Presidente</label>
                            </event>
                            <event xml:id="P.XX" from="2016-03-08" to="2026-03-09">
                                <label xml:lang="pt">XX Presidente</label>
                            </event>
                        </listEvent>
                    </org>

Two full organization names in the same language

https://github.com/cluljoseaires/ParlaMint/blob/369dc01e5b65b5a1719e487732b17b32512140ac/Data/ParlaMint-PT/ParlaMint-PT.xml#L198-L199

                    <org xml:id="Parliament" role="parliament" ana="#parla.national #parla.uni">
                        <orgName xml:lang="pt" full="yes">Assembleia da República</orgName>
                        <orgName xml:lang="pt" full="yes">Assembleia da República Portuguesa</orgName>

surname instead of nameLink

I think this type of name can be encoded with different elements: https://github.com/cluljoseaires/ParlaMint/blob/369dc01e5b65b5a1719e487732b17b32512140ac/Data/ParlaMint-PT/ParlaMint-PT.xml#L583-L593

                        <persName>
                            <forename>Alexandre</forename>
                            <surname>Nuno</surname>
                            <surname>Vaz</surname>
                            <surname>Batista</surname>
                            <surname>de</surname>
                            <surname>Vieira</surname>
                            <surname>e</surname>
                            <surname>Brito</surname>
                        </persName>

strange taxonomies are still present

one example: https://github.com/cluljoseaires/ParlaMint/blob/369dc01e5b65b5a1719e487732b17b32512140ac/Data/ParlaMint-PT/ParlaMint-PT.xml#L84-L103

                <taxonomy xml:id="parla.speakers">
                    <desc xml:lang="en">
                        <term>Types of speakers</term>
                    </desc>
                    <category xml:id="chair">
                        <catDesc>
                            <term/>
                        </catDesc>
                    </category>
                    <category xml:id="regular">
                        <catDesc>
                            <term/>
                        </catDesc>
                    </category>
                    <category xml:id="guest">
                        <catDesc>
                            <term/>
                        </catDesc>
                    </category>
                </taxonomy>

Correct in CZ corpus

            <taxonomy xml:id="speaker_types">
               <desc xml:lang="cs">
                  <term>Druhy řečníků</term>
               </desc>
               <desc xml:lang="en">
                  <term>Types of speakers</term>
               </desc>
               <category xml:id="chair">
                  <catDesc xml:lang="cs">
                     <term>Předsedající</term>: předsedá zasedání</catDesc>
                  <catDesc xml:lang="en">
                     <term>Chairperson</term>: chairman of a sitting</catDesc>
               </category>
               <category xml:id="regular">
                  <catDesc xml:lang="cs">
                     <term>Poslanec</term>: poslanec nebo člen vlády</catDesc>
                  <catDesc xml:lang="en">
                     <term>Regular</term>: a regular speaker at a sitting</catDesc>
               </category>
               <category xml:id="guest">
                  <catDesc xml:lang="cs">
                     <term>Host</term>: ghostující řečník na sezení</catDesc>
                  <catDesc xml:lang="en">
                     <term>Guest</term>: a guest speaker at a sitting</catDesc>
               </category>
            </taxonomy>
amamendes commented 2 years ago

Thanks for the feedback.

We have another question about the Portuguese data. During the tokenization, we decided to keep the frequent contractions in Portuguese and not to expand them. This is the same decision we took for the Reference Corpus of Contemporary Portuguese, and the reason is that expanding the contractions makes it difficult for users to read the text. To deal with this, we use composite POS tags. For instance, the contraction of a preposition+determiner (e.g, "no") is kept as such and is labelled with the tag "ADP+DET". We see that this is not accepted during validation. Could composite tags be integrated in ParlaMint?

matyaskopp commented 2 years ago

missing application description

https://github.com/cluljoseaires/ParlaMint/blob/369dc01e5b65b5a1719e487732b17b32512140ac/Data/ParlaMint-PT/ParlaMint-PT.ana.xml#L390-L395

            <appInfo>
                <application ident="id" version="0.1">
                    <label/>
                    <desc/>
                </application>
            </appInfo>
cluljoseaires commented 2 years ago

Neither 'republic' nor 'president' are allowed as 'presidency org role', which is why I chose 'institution', before your observation.

error: value of attribute "role" is invalid; must be equal to "boardOfDirectors", "boardOfParliament", "chamberOfTheNations", "chamberOfThePeople", "coalition", "commission", "committee", "conferenceOfChairs", "delegation", "ethnicCommunity", "europeanCommission", "europeanInstitution", "europeanParliament", "government", "institution", "internationalOrganisation", "interparliamentaryFriendshipGroup", "nationalCouncil", "ngo", "parliament", "parliamentaryGroup", "politicalParty", "senate", "subcommittee", "supervisoryBoard" or "workingGroup"

By the way, I did a pull just to make sure there were no missing updates on my side.

matyaskopp commented 2 years ago

By the way, I did a pull just to make sure there were no missing updates on my side.

My fault, it was in the documentation branch, which is now synced with the data branch: https://github.com/clarin-eric/ParlaMint/pull/356

cluljoseaires commented 2 years ago

About the political orientation:

`

Partido Social Democrata
<orgName xml:lang="pt" full="abb">PSD</orgName>
<event from="1974-04-25">
    <label xml:lang="en">existence</label>
</event>
<state type="politicalOrientation" subtype="unknown" ana="#orientation.CR">
    <note xml:lang="en">Orientation determined by encoder, using own knowledge of the parliamentary group.</note>
</state>

`

I get the following:

error: element "state" not allowed anywhere; expected the element end-tag or element "desc", "event", "idno" or "listEvent"

Perhaps I am missing something...

matyaskopp commented 2 years ago

@TomazErjavec, can you please clarify this https://github.com/clarin-eric/ParlaMint/issues/332#issuecomment-1271491379 ?

cluljoseaires commented 2 years ago

I was trying to include the party political orientation using the first case of https://clarin-eric.github.io/ParlaMint/#sec-parties

<org xml:id="PSD" role="politicalParty">
    <orgName xml:lang="pt" full="yes">Partido Social Democrata</orgName>
    <orgName xml:lang="pt" full="abb">PSD</orgName>
    <event from="1974-04-25"><label xml:lang="en">existence</label></event>
    <state type="politicalOrientation" subtype="unknown" ana="#orientation.CR">
        <note xml:lang="en">Orientation determined by encoder, using own knowledge of the parliamentary group.</note>
    </state>
</org>

but I get the «element "state" not allowed anywhere» error mentioned in my previous post.

matyaskopp commented 2 years ago

your branch is not up to date: https://github.com/cluljoseaires/ParlaMint/pull/1

cluljoseaires commented 2 years ago

Indeed, I needed to merge the main branch into the data branch... Sorry. I now need a taxonomy for the political orientations. Any tips would be welcome. ;)

TomazErjavec commented 2 years ago

@cluljoseaires, it is actually a bit premature that you would start adding political orientation. Cf. this comment, esp. the G'doc mentioned there. We are doing it via TSV files, which we first produce automatically, and then they need to be edtied, so we get the CHES orientations for free. So, let's wait untill the corpora are submitted, and they we can generate the TSVs, and these edited, and then we can automatically add their content to the TEI.

matyaskopp commented 2 years ago
cluljoseaires commented 2 years ago

Perhaps the taxonomies, in particular the ud-syn part, might need additional touches but, other than that, all issues have been addressed.

matyaskopp commented 2 years ago

ok, closing issue

matyaskopp commented 2 years ago

I am sorry I have been too rush with closing this issue...

Wrong meeting annotations in component files

https://github.com/cluljoseaires/ParlaMint/blob/5e8a2d87fb8a02db9d01da513942234aa226690f/Data/ParlaMint-PT/ParlaMint-PT_2015-01-07.xml#L7

 <meeting ana="#parla.term #L.XII" n="s.4/n.34">Série I - XII Legislatura - Sessão 4 - Número 34</meeting>

@TomazErjavec's suggestion:

<meeting ana="#parla.term #L.XII" #parla.uni" n="I/XII>Série I - XII Legislatura</meeting>
<meeting ana="#parla.session #parla.uni" n="4">Sessão 4</meeting>
<meeting ana="#parla.meeting #parla.uni" n="34">Número 34</meeting>

and add please #parla.sitting if it makes sense (I believe it does). You can use date for text() and @n

Component file /TEI/@ana

Please use the correct classification that specifies the content of the file #parla.meeting or #parla.sitting. Greek sample: https://github.com/clarin-eric/ParlaMint/blob/a7c867f644b84269a30a818363fb4ec650ed3fff/Data/ParlaMint-GR/ParlaMint-GR_2015-02-06-S1-commons.xml#L1-L6

component file title

https://clarin-eric.github.io/ParlaMint/#sec-titleStmt

The title statement starts with two titles (one main, the other subordinate), both in English and the local language, with the appropriate language code possibly inherited from a superordinate element. ... ... Both titles must be unique in the complete corpus.

Your title is not unique because you have 111 titles and 702 component files.

cluljoseaires commented 2 years ago

Thank you for the notes. I am just intrigued by the 702 files because they were supposed to be 704, but above all I am really confused by the 111 titles. Obviously, I need to do some checking...

cluljoseaires commented 2 years ago

As I suspected, the issue with the number of files had to do with the fact that some files had incorrectly repeated dates, which was masked by the remainder elements present in the filename.

cluljoseaires commented 2 years ago

I have addressed all of your notes and I believe the documents are substantially better but tell me if you need improvements.

matyaskopp commented 2 years ago

There is one issue left:

component file title

  • [ ] unique title

@TomazErjavec, can support my interpretation that both titles (title[@type='main'] and title[@type='sub']) must be unique? Both titles mean (main and sub) in both/all languages (lang=en and lang=xx)

In the example it can be seen that the main title of a corpus component is simply an extension of the corpus root title, as it also gives the name of the particular meeting that the component contains, while the subordinate title is, again, free text. Both titles must be unique in the complete corpus.

I believe that this example is misleading: (https://clarin-eric.github.io/ParlaMint/#sec-titleStmt)

<titleStmt>
 <title type="main">Slovenski parlamentarni korpus ParlaMint-SI, izredna seja 59 [ParlaMint]</title>
 <title type="main" xml:lang="en">Slovenian parliamentary corpus ParlaMint-SI, Extraordinary Session 59 [ParlaMint]</title>
 <title type="sub">Zapisi sej Državnega zbora Republike Slovenije, 7. mandat, 59. izredna seja, 13.4.2018</title>
 <title type="sub" xml:lang="en">Minutes of the National Assembly of the Republic of Slovenia, Term 7, Extraordinary Session 59, 13.4.2018</title>

because Slovenski parlamentarni korpus ParlaMint-SI, izredna seja 59 [ParlaMint] doesn't look like a unique title.

cluljoseaires commented 2 years ago

Given that @TomazErjavec has sent me a message saying everything was fine except for some notes having trailing spaces, I assume the example is acceptable.

TomazErjavec commented 1 year ago

@cluljoseaires, I just rebuilt your corpus and it turns out it is no longer valid. Namely (cf. #472) after you submitted your corpus we decided on a change to the schema (and Guidelines) so that div elements that do not contain utterances (but only headings, notes) should get the type "commentSection" rather than "debateSection", and your corpus has a number of such divs.

My finalization script corrects this, so you don't actually have to do anything if you can't be bothered and don't care if your copy of the corpus is different from the one that goes into the repository. But if you would like to keep them in synch, then please correct the types of these divs. The list of the files to be corrected can be found in https://nl.ijs.si/et/tmp/ParlaMint/Repo/ParlaMint-PT.log (grep for "no utterances in div").

Note also that there are two reported errors as regars bad chars (again, this is new check):

ERROR: File ParlaMint-PT_2019-12-20.ana.xml contains bad chars: U+F0B7 (4x)
ERROR: File ParlaMint-PT_2020-04-16.ana.xml contains bad chars: U+F020 (2x)

This I don't fix as it might be non-trivial, then again, we can live with 6 bad characters in the corpus.

And sorry about this late change & pls. let me know if you will (not) fix this, so I can re-close this issue.