Closed matyaskopp closed 1 year ago
Thanks Matyás, we're going to look into it.
Thank you so much for your tips. I already suspected the validation process would not be enough to have a finished product, but at least we have that out of the way.
Considering your observations, I believe I can fix the issues relatively easily since I already have the data in the cases where it is missing. I did not include it simply because I forgot after so many iterations to make the tests iteratively pass in this overwhelming task.
Once I have taken care of the issues, I will do another pull request.
Considering your observations, I believe I can fix the issues relatively easily since I already have the data in the cases where it is missing. I did not include it simply because I forgot after so many iterations to make the tests iteratively pass in this overwhelming task.
I am glad to hear that it wouldn't be a big issue.
I have one more observation:
bibl
linkIt would be great to have a direct link to transcription in component files. You have one common link for all files, but it is better to use the real source of data.
You can check CZ data, where I am preserving a lot of link types
bibl/idno
pb/@source
u/@source
Nice work, I am adding a few more observations
discussed here: https://github.com/clarin-eric/ParlaMint/issues/305 https://github.com/cluljoseaires/ParlaMint/blob/369dc01e5b65b5a1719e487732b17b32512140ac/Data/ParlaMint-PT/ParlaMint-PT.xml#L185-L196
<org xml:id="Presidency" role="institution">
<orgName xml:lang="pt" full="yes">Presidência</orgName>
<orgName xml:lang="pt" full="yes">Presidência da República Portuguesa</orgName>
<listEvent>
<event xml:id="P.XIX" from="2006-03-09" to="2016-03-08">
<label xml:lang="pt">XIX Presidente</label>
</event>
<event xml:id="P.XX" from="2016-03-08" to="2026-03-09">
<label xml:lang="pt">XX Presidente</label>
</event>
</listEvent>
</org>
<org xml:id="Parliament" role="parliament" ana="#parla.national #parla.uni">
<orgName xml:lang="pt" full="yes">Assembleia da República</orgName>
<orgName xml:lang="pt" full="yes">Assembleia da República Portuguesa</orgName>
surname
instead of nameLink
nameLink
I think this type of name can be encoded with different elements: https://github.com/cluljoseaires/ParlaMint/blob/369dc01e5b65b5a1719e487732b17b32512140ac/Data/ParlaMint-PT/ParlaMint-PT.xml#L583-L593
<persName>
<forename>Alexandre</forename>
<surname>Nuno</surname>
<surname>Vaz</surname>
<surname>Batista</surname>
<surname>de</surname>
<surname>Vieira</surname>
<surname>e</surname>
<surname>Brito</surname>
</persName>
@xml:id
s <taxonomy xml:id="parla.speakers">
<desc xml:lang="en">
<term>Types of speakers</term>
</desc>
<category xml:id="chair">
<catDesc>
<term/>
</catDesc>
</category>
<category xml:id="regular">
<catDesc>
<term/>
</catDesc>
</category>
<category xml:id="guest">
<catDesc>
<term/>
</catDesc>
</category>
</taxonomy>
Correct in CZ corpus
<taxonomy xml:id="speaker_types">
<desc xml:lang="cs">
<term>Druhy řečníků</term>
</desc>
<desc xml:lang="en">
<term>Types of speakers</term>
</desc>
<category xml:id="chair">
<catDesc xml:lang="cs">
<term>Předsedající</term>: předsedá zasedání</catDesc>
<catDesc xml:lang="en">
<term>Chairperson</term>: chairman of a sitting</catDesc>
</category>
<category xml:id="regular">
<catDesc xml:lang="cs">
<term>Poslanec</term>: poslanec nebo člen vlády</catDesc>
<catDesc xml:lang="en">
<term>Regular</term>: a regular speaker at a sitting</catDesc>
</category>
<category xml:id="guest">
<catDesc xml:lang="cs">
<term>Host</term>: ghostující řečník na sezení</catDesc>
<catDesc xml:lang="en">
<term>Guest</term>: a guest speaker at a sitting</catDesc>
</category>
</taxonomy>
Thanks for the feedback.
We have another question about the Portuguese data. During the tokenization, we decided to keep the frequent contractions in Portuguese and not to expand them. This is the same decision we took for the Reference Corpus of Contemporary Portuguese, and the reason is that expanding the contractions makes it difficult for users to read the text. To deal with this, we use composite POS tags. For instance, the contraction of a preposition+determiner (e.g, "no") is kept as such and is labelled with the tag "ADP+DET". We see that this is not accepted during validation. Could composite tags be integrated in ParlaMint?
<appInfo>
<application ident="id" version="0.1">
<label/>
<desc/>
</application>
</appInfo>
Neither 'republic' nor 'president' are allowed as 'presidency org role', which is why I chose 'institution', before your observation.
error: value of attribute "role" is invalid; must be equal to "boardOfDirectors", "boardOfParliament", "chamberOfTheNations", "chamberOfThePeople", "coalition", "commission", "committee", "conferenceOfChairs", "delegation", "ethnicCommunity", "europeanCommission", "europeanInstitution", "europeanParliament", "government", "institution", "internationalOrganisation", "interparliamentaryFriendshipGroup", "nationalCouncil", "ngo", "parliament", "parliamentaryGroup", "politicalParty", "senate", "subcommittee", "supervisoryBoard" or "workingGroup"
By the way, I did a pull just to make sure there were no missing updates on my side.
By the way, I did a pull just to make sure there were no missing updates on my side.
My fault, it was in the documentation branch, which is now synced with the data branch: https://github.com/clarin-eric/ParlaMint/pull/356
About the political orientation:
`
<orgName xml:lang="pt" full="abb">PSD</orgName>
<event from="1974-04-25">
<label xml:lang="en">existence</label>
</event>
<state type="politicalOrientation" subtype="unknown" ana="#orientation.CR">
<note xml:lang="en">Orientation determined by encoder, using own knowledge of the parliamentary group.</note>
</state>
`
I get the following:
error: element "state" not allowed anywhere; expected the element end-tag or element "desc", "event", "idno" or "listEvent"
Perhaps I am missing something...
@TomazErjavec, can you please clarify this https://github.com/clarin-eric/ParlaMint/issues/332#issuecomment-1271491379 ?
I was trying to include the party political orientation using the first case of https://clarin-eric.github.io/ParlaMint/#sec-parties
<org xml:id="PSD" role="politicalParty">
<orgName xml:lang="pt" full="yes">Partido Social Democrata</orgName>
<orgName xml:lang="pt" full="abb">PSD</orgName>
<event from="1974-04-25"><label xml:lang="en">existence</label></event>
<state type="politicalOrientation" subtype="unknown" ana="#orientation.CR">
<note xml:lang="en">Orientation determined by encoder, using own knowledge of the parliamentary group.</note>
</state>
</org>
but I get the «element "state" not allowed anywhere» error mentioned in my previous post.
your branch is not up to date: https://github.com/cluljoseaires/ParlaMint/pull/1
Indeed, I needed to merge the main branch into the data branch... Sorry. I now need a taxonomy for the political orientations. Any tips would be welcome. ;)
@cluljoseaires, it is actually a bit premature that you would start adding political orientation. Cf. this comment, esp. the G'doc mentioned there. We are doing it via TSV files, which we first produce automatically, and then they need to be edtied, so we get the CHES orientations for free. So, let's wait untill the corpora are submitted, and they we can generate the TSVs, and these edited, and then we can automatically add their content to the TEI.
@xml:lang="en"
attributeNER
, speaker_types
, subcorpus
, parla.legislature
) and add to them a pt translation. #264
as for the ud-syn
taxonomy, I am planning to create one common taxonomy (#321) for all possible languages in universal dependencies, so if you did not make changes in ids (= id corresponds to a term and the colon is replaced with an underscore), there shouldn't be a problem with embedding this complete taxonomy in your corpus - you don't need to care about it.langUsage
element - documented here: https://clarin-eric.github.io/ParlaMint/#sec-langUsagePerhaps the taxonomies, in particular the ud-syn part, might need additional touches but, other than that, all issues have been addressed.
ok, closing issue
I am sorry I have been too rush with closing this issue...
meeting
-> session
,meeting
, sitting
<meeting ana="#parla.term #L.XII" n="s.4/n.34">Série I - XII Legislatura - Sessão 4 - Número 34</meeting>
@TomazErjavec's suggestion:
<meeting ana="#parla.term #L.XII" #parla.uni" n="I/XII>Série I - XII Legislatura</meeting>
<meeting ana="#parla.session #parla.uni" n="4">Sessão 4</meeting>
<meeting ana="#parla.meeting #parla.uni" n="34">Número 34</meeting>
and add please #parla.sitting
if it makes sense (I believe it does). You can use date for text()
and @n
/TEI/@ana
/TEI/@ana
Please use the correct classification that specifies the content of the file #parla.meeting
or #parla.sitting
. Greek sample:
https://github.com/clarin-eric/ParlaMint/blob/a7c867f644b84269a30a818363fb4ec650ed3fff/Data/ParlaMint-GR/ParlaMint-GR_2015-02-06-S1-commons.xml#L1-L6
https://clarin-eric.github.io/ParlaMint/#sec-titleStmt
The title statement starts with two titles (one main, the other subordinate), both in English and the local language, with the appropriate language code possibly inherited from a superordinate element. ... ... Both titles must be unique in the complete corpus.
Your title is not unique because you have 111 titles and 702 component files.
Thank you for the notes. I am just intrigued by the 702 files because they were supposed to be 704, but above all I am really confused by the 111 titles. Obviously, I need to do some checking...
As I suspected, the issue with the number of files had to do with the fact that some files had incorrectly repeated dates, which was masked by the remainder elements present in the filename.
I have addressed all of your notes and I believe the documents are substantially better but tell me if you need improvements.
There is one issue left:
component file title
- [ ] unique title
@TomazErjavec, can support my interpretation that both titles (title[@type='main']
and title[@type='sub']
) must be unique?
Both titles mean (main
and sub
) in both/all languages (lang=en
and lang=xx
)
In the example it can be seen that the main title of a corpus component is simply an extension of the corpus root title, as it also gives the name of the particular meeting that the component contains, while the subordinate title is, again, free text. Both titles must be unique in the complete corpus.
I believe that this example is misleading: (https://clarin-eric.github.io/ParlaMint/#sec-titleStmt)
<titleStmt>
<title type="main">Slovenski parlamentarni korpus ParlaMint-SI, izredna seja 59 [ParlaMint]</title>
<title type="main" xml:lang="en">Slovenian parliamentary corpus ParlaMint-SI, Extraordinary Session 59 [ParlaMint]</title>
<title type="sub">Zapisi sej Državnega zbora Republike Slovenije, 7. mandat, 59. izredna seja, 13.4.2018</title>
<title type="sub" xml:lang="en">Minutes of the National Assembly of the Republic of Slovenia, Term 7, Extraordinary Session 59, 13.4.2018</title>
because Slovenski parlamentarni korpus ParlaMint-SI, izredna seja 59 [ParlaMint]
doesn't look like a unique title.
Given that @TomazErjavec has sent me a message saying everything was fine except for some notes having trailing spaces, I assume the example is acceptable.
@cluljoseaires, I just rebuilt your corpus and it turns out it is no longer valid. Namely (cf. #472) after you submitted your corpus we decided on a change to the schema (and Guidelines) so that div elements that do not contain utterances (but only headings, notes) should get the type "commentSection" rather than "debateSection", and your corpus has a number of such divs.
My finalization script corrects this, so you don't actually have to do anything if you can't be bothered and don't care if your copy of the corpus is different from the one that goes into the repository. But if you would like to keep them in synch, then please correct the types of these divs. The list of the files to be corrected can be found in https://nl.ijs.si/et/tmp/ParlaMint/Repo/ParlaMint-PT.log (grep for "no utterances in div").
Note also that there are two reported errors as regars bad chars (again, this is new check):
ERROR: File ParlaMint-PT_2019-12-20.ana.xml contains bad chars: U+F0B7 (4x)
ERROR: File ParlaMint-PT_2020-04-16.ana.xml contains bad chars: U+F020 (2x)
This I don't fix as it might be non-trivial, then again, we can live with 6 bad characters in the corpus.
And sorry about this late change & pls. let me know if you will (not) fix this, so I can re-close this issue.
This issue tries to summarize changes that are needed to be done to have better metadata in PT corpus. (commit 94489ada469c2ea1f548c911f65fcca85c680ec4)
Firstly I want to note that if the corpus is valid, then it should pass all validations (if there is not a bug in the validation script). But the opposite implication is not true! So passing validation doesn't necessarily mean that the corpus is valid.
ParlaMint structure and metadata
person name
persName/term
You use
term
element inpersName
:The
term
is meant to be used if every other option fails: https://clarin-eric.github.io/ParlaMint/#TEI.persNameSo the correct encoding of person's name is to use
forename
andsurname
inlistPerson/person
contextaffiliation roles
You use only
member
role in affiliations. there is nominister
orhead
role. I guess you have at leastminister
role information because there are affiliations with government. Remember that if someone is affiliated withminister
role one should be also affiliated withmember
role for same period as well: https://clarin-eric.github.io/ParlaMint/#sec-affiliationaffiliation/@role
taxonomies
taxonomy
There are taxonomies that I don't understand, ie: https://github.com/cluljoseaires/ParlaMint/blob/94489ada469c2ea1f548c911f65fcca85c680ec4/Data/ParlaMint-PT/ParlaMint-PT.ana.xml#L127-L145
You are supposed to take common taxonomies and translate them into the corpus language. You can start with taxonomies from SI corpus:https://github.com/clarin-eric/ParlaMint/blob/5e4fdd6d638bd22bbf121ffcb605cd76ebf952b4/Data/ParlaMint-SI/ParlaMint-SI.ana.xml#L117-L125
We will then merge (on
@xml:id
attribute) these translations into one common taxonomy (https://github.com/clarin-eric/ParlaMint/issues/264)meeting element
meeting
https://github.com/cluljoseaires/ParlaMint/blob/94489ada469c2ea1f548c911f65fcca85c680ec4/Data/ParlaMint-PT/ParlaMint-PT.ana.xml#L8meeting
elements do not provide enough information. See examples: https://clarin-eric.github.io/ParlaMint/#TEI.meeting Also component files need fixingssetting
(and alsobibl
)in component filessetting
setting
in a component file should describe the content of the component file, not the whole corpus: https://github.com/cluljoseaires/ParlaMint/blob/94489ada469c2ea1f548c911f65fcca85c680ec4/Data/ParlaMint-PT/ParlaMint-PT_darl12sl04n034-07-01-2015.xml#L68for this file
ParlaMint-PT_darl12sl04n034-07-01-2015.xml
it should bestrangeness in data (my observations)
corpus timespan (end)
settingDesc
: https://github.com/cluljoseaires/ParlaMint/blob/94489ada469c2ea1f548c911f65fcca85c680ec4/Data/ParlaMint-PT/ParlaMint-PT.ana.xml#L386last parliament event: https://github.com/cluljoseaires/ParlaMint/blob/94489ada469c2ea1f548c911f65fcca85c680ec4/Data/ParlaMint-PT/ParlaMint-PT.ana.xml#L408-L410
last government: https://github.com/cluljoseaires/ParlaMint/blob/94489ada469c2ea1f548c911f65fcca85c680ec4/Data/ParlaMint-PT/ParlaMint-PT.ana.xml#L426-L428
sourceDesc/bibl
: https://github.com/cluljoseaires/ParlaMint/blob/94489ada469c2ea1f548c911f65fcca85c680ec4/Data/ParlaMint-PT/ParlaMint-PT.ana.xml#L49last file in your full data
title
: https://github.com/cluljoseaires/ParlaMint/blob/94489ada469c2ea1f548c911f65fcca85c680ec4/Data/ParlaMint-PT/ParlaMint-PT.ana.xml#L7I believe that at least
settingDesc
should correspond to the corpus timespan: (https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-settingDesc.html)affiliation of dead person
https://github.com/cluljoseaires/ParlaMint/blob/94489ada469c2ea1f548c911f65fcca85c680ec4/Data/ParlaMint-PT/ParlaMint-PT.ana.xml#L1109-L1120