Closed matyaskopp closed 1 year ago
About these specific issues in this list, can you please clarify what is wrong in:
About these specific issues in this list, can you please clarify what is wrong in:
- datespan in title
- corpus should contain multiple terms ? About the other issues, some are easy to fix: I can update the GitHub samples starting from next Monday. Others require more work.
The period of corpus is huger than 2017-2017. The reason for this 2017-2017
span is probably that your sample contains only a small period.
It is better to use one file from each term in the sample because the sample will be more informative, and it can happen that older transcripts can contain some phenomena that don't appear in the newest ones.
I should add that even for sample, the corpus roots should be as they are for the full corpus, i.e. the dates that the corpus covers, the complete list of person and organisations, etc.
I have updated the samples in GitHub. However, I run into problems with validation on my side...
Does 'make validate-parlamint-FR' work on the 'ana' versions?
'make val-schema-ana-FR' gives weird errors...
../ParlaMint/Data/TMP/ParlaMint-FR/ParlaMint-FR.ana.xml:2:1412: error: ID "ParlaMint-FR_2017-06-28-O1125.d1_1" has already been defined
Your ids are not unique because you are importing TEI version into TEI.ana version: https://github.com/gclux/ParlaMint/blob/712a9fd76d66fe0f896b67c871fc303124d5eb90/Data/ParlaMint-FR/ParlaMint-FR.ana.xml#L16739-L16750
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="2017/ParlaMint-FR_2017-06-28-O1125.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="2018/ParlaMint-FR_2018-01-16-O1111.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="2019/ParlaMint-FR_2019-09-10-E2001.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="2020/ParlaMint-FR_2020-01-07-O1114.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="2021/ParlaMint-FR_2021-01-12-O1125.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="2022/ParlaMint-FR_2022-03-23-O1168.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="2017/ParlaMint-FR_2017-06-28-O1125.ana.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="2018/ParlaMint-FR_2018-01-16-O1111.ana.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="2019/ParlaMint-FR_2019-09-10-E2001.ana.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="2020/ParlaMint-FR_2020-01-07-O1114.ana.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="2021/ParlaMint-FR_2021-01-12-O1125.ana.xml"/>
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="2022/ParlaMint-FR_2022-03-23-O1168.ana.xml"/>
Yes. Just updated. But, there is a problem in my validation script...
Scripts/validate-parlamint.pl Schema 'Data/ParlaMint-FR'
INFO: Validating directory /home/glux/Work/GitHub/ParlaMint/Data/ParlaMint-FR
INFO: Validating TEI root /home/glux/Work/GitHub/ParlaMint/Data/ParlaMint-FR/ParlaMint-FR.xml
INFO: Char validation for ParlaMint-FR.xml
Died at Scripts/validate-parlamint.pl line 55, <IN> chunk 1.
make: *** [Makefile:204: validate-parlamint-FR] Error 255
But, there is a problem in my validation script Died at Scripts/validate-parlamint.pl line 55
Oh dear, this is my faut, sorry. In the devel banch there was an explicit "die" in this script, which was there for some testing purposes. I removed it now.
@gclux I have updated https://github.com/clarin-eric/ParlaMint/issues/574#issue-1515142559 to reflect this status: https://github.com/gclux/ParlaMint/commit/2885b85ffe1403ffa74f005b43b218ef431069bd
About...
terms should be in parliament organization
and..
missing terms in component files
...I apparently misunderstood the
OK, I can remove the terms from the taxonomy and use the org/@xml:id. This will give in the root file:
<meeting n="168" corresp="#ParlaMint-FR-LOWER" ana="#parla.national #parla.lower #parla.term #PO717460">15e législature</meeting>
<meeting n="2" corresp="#ParlaMint-FR-LOWER" ana="#parla.national #parla.lower #parla.term #PO791932">16e législature</meeting>
...and in the last component file of the 15th term:
<meeting n="O1"
corresp="#ParlaMint-FR-LOWER"
ana="#parla.session #ParlaMint-FR-LOWER"
xml:lang="fr">Session ordinaire 2021-2022 (CRSANR5L15S2022O1N168)</meeting>
<meeting n="168"
corresp="#ParlaMint-FR-LOWER"
ana="#parla.sitting #ParlaMint-FR-LOWER #PO717460"
xml:lang="fr">168. séance</meeting>
Do we agree?
OK, I can remove the terms from the taxonomy and use the org/@xml:id. This will give in the root file:
Use
org/eventList/event/@xml:id
<meeting n="168" corresp="#ParlaMint-FR-LOWER" ana="#parla.national #parla.lower #parla.term #PO717460">15e législature</meeting> <meeting n="2" corresp="#ParlaMint-FR-LOWER" ana="#parla.national #parla.lower #parla.term #PO791932">16e législature</meeting>
Root file should contain (I hope the ids refer to the correct events...):
<meeting n="15" corresp="#ParlaMint-FR-LOWER" ana="#parla.national #parla.lower #parla.term #PO717460" xml:lang="fr">15e législature</meeting>
<meeting n="16" corresp="#ParlaMint-FR-LOWER" ana="#parla.national #parla.lower #parla.term #PO791932" xml:lang="fr">16e législature</meeting>
...and in the last component file of the 15th term:
<meeting n="O1" corresp="#ParlaMint-FR-LOWER" ana="#parla.session #ParlaMint-FR-LOWER" xml:lang="fr">Session ordinaire 2021-2022 (CRSANR5L15S2022O1N168)</meeting> <meeting n="168" corresp="#ParlaMint-FR-LOWER" ana="#parla.sitting #ParlaMint-FR-LOWER #PO717460" xml:lang="fr">168. séance</meeting>
<meeting n="15"
corresp="#ParlaMint-FR-LOWER"
ana="#parla.national #parla.lower #parla.term #PO717460"
xml:lang="fr">15e législature</meeting>
<meeting n="O1"
corresp="#ParlaMint-FR-LOWER"
ana="#parla.session #ParlaMint-FR-LOWER"
xml:lang="fr">Session ordinaire 2021-2022 (CRSANR5L15S2022O1N168)</meeting>
<meeting n="168"
corresp="#ParlaMint-FR-LOWER"
ana="#parla.sitting #ParlaMint-FR-LOWER"
xml:lang="fr">168. séance</meeting>
OK. I will now use more readable ids for the term...
<meeting n="15"
corresp="#ParlaMint-FR-LOWER"
ana="#parla.national #parla.lower #parla.term #parla.term.16"
xml:lang="fr">15e législature</meeting>
<meeting n="O1"
corresp="#ParlaMint-FR-LOWER"
ana="#parla.session #ParlaMint-FR-LOWER"
xml:lang="fr">Session ordinaire 2021-2022 (CRSANR5L15S2022O1N168)</meeting>
<meeting n="168"
corresp="#ParlaMint-FR-LOWER"
ana="#parla.sitting #ParlaMint-FR-LOWER"
xml:lang="fr">168. séance</meeting>
I think that the documentation may be improved here... https://clarin-eric.github.io/ParlaMint/#sec-titleStmt As opposed to the given example, in France we have three levels of meeting description: term - session - sitting I will try to use 'term', which is the correct translation of the French 'législature'.
@gclux thanks for updating the sample, It is great that you were able to fix the quest speakers
I have updated the ticks., there are two ones unticked:
About: merge repeated organizations
It is not an error. This is the case of a ministry shared by two ministers: https://en.wikipedia.org/wiki/Jacqueline_Gourault
...she previously served as Minister attached to the Minister of the Interior from 2017 to 2018.
I think the best would be to manually "patch" the second organization...
<org xml:id="PO729937" role="ministry">
<orgName full="yes" xml:lang="fr">Ministère de l’intérieur</orgName>
<orgName full="abb">INT</orgName>
<event from="2017-06-22" to="2018-10-16">
<label xml:lang="en">existence</label>
</event>
</org>
<org xml:id="PO730004" role="ministry">
<orgName full="yes" xml:lang="fr">Ministère auprès du ministre d'État, ministre de l'intérieur</orgName>
<orgName full="abb">INT</orgName>
<event from="2017-06-22" to="2018-10-16">
<label xml:lang="en">existence</label>
</event>
</org>
Have you found any other cases of such duplications?
Have you found any other cases of such duplications?
no, I've overlooked - only one such organization
About: unique main title
I did not know the title had to be unique! ... I am surprised this errors shows up now!!!
I can copy the ", séance : 2, 25/09/2017" from the subtitle (I believe there may be several sittings in the same day).. This mention would then be redundant.
Is it also the case for the subtitle? (to be unique)
Alternatively, I can take the unique id from the source file.
But his would imply a new recompilation!!!
I did not know the title had to be unique! ... I am surprised this errors shows up now!!!
Mention about unique title is in documentation (https://clarin-eric.github.io/ParlaMint/#exa-titleStmtComp):
In the example it can be seen that the main title of a corpus component is simply an extension of the corpus root title, as it also gives the name of the particular meeting that the component contains, while the subordinate title is, again, free text. Both titles must be unique in the complete corpus.
but it is not in the validation script. Copying values from subtitles seems ok to me.
@TomazErjavec do we insist on this, it is your requirement and I am not sure where it came from (inherited from TEI recommendations?)
a unique title and duplicated organization seem to be fixed in data delivered to @TomazErjavec, so closing this issue and merging the sample (will be fixed/overwritten by ParlaMint v3.0 sample)
date in title
https://github.com/gclux/ParlaMint/blob/0d76f9ca9c02e85a2e8744ff69b709a46c7a90d2/Data/ParlaMint-FR/ParlaMint-FR.xml#L8-L9
corpus terms
setting in root should contain whole corpus period
<setting>
https://github.com/gclux/ParlaMint/blob/0d76f9ca9c02e85a2e8744ff69b709a46c7a90d2/Data/ParlaMint-FR/ParlaMint-FR.xml#L390-L396
wrong dates in subcorpus taxonomy
https://github.com/gclux/ParlaMint/blob/0d76f9ca9c02e85a2e8744ff69b709a46c7a90d2/Data/ParlaMint-FR/ParlaMint-FR.xml#L367-L386
see: https://github.com/clarin-eric/ParlaMint/blob/5deaeed5ae792f3ba1726072298885b5b64a6d64/Data/ParlaMint-AT/ParlaMint-taxonomy-subcorpus.xml#L2-L16
wrong government events from dates
https://github.com/gclux/ParlaMint/blob/0d76f9ca9c02e85a2e8744ff69b709a46c7a90d2/Data/ParlaMint-FR/ParlaMint-FR.xml#L408-L421 Every event starts on "1959-01-09"
merge repeated organizations
multiple organizations are repeated, eg: 2 times:
Ministère de l'intérieur
2 times:Ministère de l’intérieur
non-attached member affiliation role
xml-model in component file preambule
<?xml-model ...
https://github.com/gclux/ParlaMint/blob/0d76f9ca9c02e85a2e8744ff69b709a46c7a90d2/Data/ParlaMint-FR/2017-18/ParlaMint-FR_2017-09-25-E2002.xml#L2
unique main title
I am not sure if this main title is unique among whole corpus, you can append date to make it unique: https://github.com/gclux/ParlaMint/blob/0d76f9ca9c02e85a2e8744ff69b709a46c7a90d2/Data/ParlaMint-FR/2017-18/ParlaMint-FR_2017-09-25-E2002.xml#L12-L13
use chair when chair is speaking
#chair
instead of#speaker
?#regular
instead of#government
You have added new speaker roles. For instance, I have no idea what
speaker
means - sometimes it looks like a regular, sometimes a chair. Now we support these "roles":chair
,regular
,guest
. In v3.1 we plan to unify common taxonomies, which can raise problems.https://github.com/gclux/ParlaMint/blob/0d76f9ca9c02e85a2e8744ff69b709a46c7a90d2/Data/ParlaMint-FR/2017-18/ParlaMint-FR_2017-09-25-E2002.xml#L103
missing join right in articles
eg:
La parole est à M. le ministre d’État, ministre de l’intérieur.
missing terms in component files
https://github.com/gclux/ParlaMint/blob/2885b85ffe1403ffa74f005b43b218ef431069bd/Data/ParlaMint-FR/2022/ParlaMint-FR_2022-03-23-O1168.xml#L14-L21
should be extended with term information - add this line (copied from root file):
Volodymyr Zelenskyy should be a guest - definitely not unknown
https://github.com/gclux/ParlaMint/blob/2885b85ffe1403ffa74f005b43b218ef431069bd/Data/ParlaMint-FR/2022/ParlaMint-FR_2022-03-23-O1168.xml#L135
And the rest of the speech is wrongly attributed to
PA720124
(Aude Amadou)https://github.com/gclux/ParlaMint/blob/2885b85ffe1403ffa74f005b43b218ef431069bd/Data/ParlaMint-FR/2022/ParlaMint-FR_2022-03-23-O1168.xml#L146-L160
terms should be in parliament organization
Term should be an
event
in parliament notcategory
in taxonomy.This should be removed https://github.com/gclux/ParlaMint/blob/2885b85ffe1403ffa74f005b43b218ef431069bd/Data/ParlaMint-FR/ParlaMint-FR.xml#L216-L225
And correct terms should be added to
meeting
element