Closed TomazErjavec closed 1 year ago
Can I see an example, because I am not sure I correctly understand what is wrong?
Wiadomość napisana przez Tomaž Erjavec @.***> w dniu 21.04.2023, o godz. 21:16:
I have re-processed all the submitted corpora, and PL showed errors for missing taxonomies. I had a look, and it turns out parlamint-factorize-teiHeader.xsl did not factorise the teiHeader appropriatelly. I didn't manage to find the problem in the script itself, but I think I did indentify the problem in the corpus: all the taxonomies were in-place (so, in the TEI header) but listPerson and listOrg were factorised. I've now included these two lists into the teiHeader, so that it is completely "unfactorised", and now everything seems to work fine. So, it seems parlamint-factorize-teiHeader.xsl does not work correctly if the teiHeader is partially factorised. Not sure myself if this is a proper bug, because this situation is anomalous, nobody but PL had it. So, @mrudolf https://github.com/mrudolf, be aware for any future submissions that either the teiHeader is completely factorised, or not at all (unless @matyaskopp https://github.com/matyaskopp fixes his script, in which case it won't matter.
— Reply to this email directly, view it on GitHub https://github.com/clarin-eric/ParlaMint/issues/636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAS5RALOW4C24UD3JZOCMCTXCLMKBANCNFSM6AAAAAAXHHJUQI. You are receiving this because you were mentioned.
— Michał Rudolf
Can I see an example, because I am not sure I correctly understand what is wrong?
The partners submitted either their root file (ParlaMint-XX.xml / ParlaMint-AT.ana.xml) which has all the taxonomies, listPerson and listOrg included in its teiHeader or they have submitted the root file which XIncludes the taxonomies, listPerson and listOrg, e.g.
-rw-r--r-- 1 tomaz tomaz 40920 Dec 14 15:59 ParlaMint-AT-listOrg.xml
-rw-r--r-- 1 tomaz tomaz 881591 Feb 10 16:20 ParlaMint-AT-listPerson.xml
-rw-r--r-- 1 tomaz tomaz 173905 Dec 16 12:14 ParlaMint-AT.xml
-rw-r--r-- 1 tomaz tomaz 4152 Dec 14 15:59 ParlaMint-taxonomy-parla.legislature.xml
-rw-r--r-- 1 tomaz tomaz 1129 Dec 14 15:59 ParlaMint-taxonomy-speaker_types.xml
-rw-r--r-- 1 tomaz tomaz 775 Dec 14 15:59 ParlaMint-taxonomy-subcorpus.xml
But you have the taxonomies included in the teiHeader, but listPerson and listOrg XIncluded:
-rw-r--r-- 1 tomaz tomaz 3733 Mar 28 11:23 ParlaMint-PL-listOrg.xml
-rw-r--r-- 1 tomaz tomaz 500454 Mar 28 11:23 ParlaMint-PL-listPerson.xml
-rw-rw-r-- 1 tomaz tomaz 66424 Mar 28 18:25 ParlaMint-PL.xml
For the corpora that will be distributed, I do the factorisation, i.e. that your corpus winds up like:
-rw-rw-r-- 1 tomaz tomaz 3860 Apr 22 01:53 ParlaMint-PL-listOrg.xml
-rw-rw-r-- 1 tomaz tomaz 524821 Apr 22 01:53 ParlaMint-PL-listPerson.xml
-rw-rw-r-- 1 tomaz tomaz 55067 Apr 22 01:53 ParlaMint-PL.xml
-rw-rw-r-- 1 tomaz tomaz 8091 Apr 22 01:53 ParlaMint-taxonomy-parla.legislature.xml
-rw-rw-r-- 1 tomaz tomaz 966 Apr 22 01:53 ParlaMint-taxonomy-speaker_types.xml
-rw-rw-r-- 1 tomaz tomaz 818 Apr 22 01:53 ParlaMint-taxonomy-subcorpus.xml
But the factorisation script does not seem to like that you have a "semi factorised" root file. So, what I did now, is to put listPerson and listOrg back into the teiHeader of the root file, so it is simply:
-rw-rw-r-- 1 tomaz tomaz 587393 Apr 21 20:27 ParlaMint-PL.ana.xml
and then the factorisation works fine, with the resut that you can find at https://nl.ijs.si/et/tmp/ParlaMint/Repo/ParlaMint-PL.tgz and https://nl.ijs.si/et/tmp/ParlaMint/Repo/ParlaMint-PT.ana.tgz
So, your files are ok at my end, but maybe you should do the same for any further submissions, i.e. just put listPerson and listOrg inside the teiHeader.
@TomazErjavec, I have discovered different issue with the way how you use the script.
I am calling it this way (in sample - both TEI and TEI.ana in the same folder): https://github.com/clarin-eric/ParlaMint/blob/91ba6eb0c7e82240638c5b11e25c47237b8be619/Makefile#L307-L319 First, factorize TEI version and then factorize TEI.ana and skip files seen in TEI version.
Your script does not count with IS situation (custom taxonomy in both TEI and TEI.ana versions):
$ ls -l ParlaMint-IS.TEI*/ParlaMint-*
-rw-rw-r-- 1 tomaz tomaz 5683 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-IS-listOrg.xml
-rw-rw-r-- 1 tomaz tomaz 198355 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-IS-listPerson.xml
-rw-rw-r-- 1 tomaz tomaz 6800 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-IS-taxonomy-gov.ministries.ana.xml
-rw-rw-r-- 1 tomaz tomaz 9120 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-IS-taxonomy-parla.categories.ana.xml
-rw-rw-r-- 1 tomaz tomaz 5424 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-IS-taxonomy-parla.constituencies.ana.xml
-rw-rw-r-- 1 tomaz tomaz 490205 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-IS-taxonomy-parla.topics.ana.xml
-rw-rw-r-- 1 tomaz tomaz 109368 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-IS.ana.xml
-rw-rw-r-- 1 tomaz tomaz 865 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-taxonomy-NER.ana.xml
-rw-rw-r-- 1 tomaz tomaz 4672 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-taxonomy-UD-SYN.ana.xml
-rw-rw-r-- 1 tomaz tomaz 5426 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-taxonomy-parla.legislature.xml
-rw-rw-r-- 1 tomaz tomaz 941 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-taxonomy-speaker_types.xml
-rw-rw-r-- 1 tomaz tomaz 818 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-taxonomy-subcorpus.xml
-rw-rw-r-- 1 tomaz tomaz 5683 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-IS-listOrg.xml
-rw-rw-r-- 1 tomaz tomaz 198355 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-IS-listPerson.xml
-rw-rw-r-- 1 tomaz tomaz 6796 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-IS-taxonomy-gov.ministries.xml
-rw-rw-r-- 1 tomaz tomaz 9116 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-IS-taxonomy-parla.categories.xml
-rw-rw-r-- 1 tomaz tomaz 5420 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-IS-taxonomy-parla.constituencies.xml
-rw-rw-r-- 1 tomaz tomaz 490201 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-IS-taxonomy-parla.topics.xml
-rw-rw-r-- 1 tomaz tomaz 102776 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-IS.xml
-rw-rw-r-- 1 tomaz tomaz 5426 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-taxonomy-parla.legislature.xml
-rw-rw-r-- 1 tomaz tomaz 941 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-taxonomy-speaker_types.xml
-rw-rw-r-- 1 tomaz tomaz 818 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-taxonomy-subcorpus.xml
So you get two similar files with different names. e.g. (the size is different due to different /taxonomy/@xml:id
)
-rw-rw-r-- 1 tomaz tomaz 9120 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-IS-taxonomy-parla.categories.ana.xml
-rw-rw-r-- 1 tomaz tomaz 9116 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-IS-taxonomy-parla.categories.xml
For an annotated version, you should extend the parameter noAna
with files seen in TEI version:
https://github.com/clarin-eric/ParlaMint/blob/64229bb1a752f2b0073f103b7776dbece9527a7a/Scripts/parlamint2distro.pl#L309
Partial factorization is caused by your script: variable $factorised = 1
https://github.com/clarin-eric/ParlaMint/blob/64229bb1a752f2b0073f103b7776dbece9527a7a/Scripts/parlamint2distro.pl#L298-L311
I believe this is now solved with feba24b9c9e08c3bffedece408d9f0d9c6db55b6. It is also discussed here #675
I have re-processed all the submitted corpora, and PL showed errors for missing taxonomies. I had a look, and it turns out parlamint-factorize-teiHeader.xsl did not factorise the teiHeader appropriatelly. I didn't manage to find the problem in the script itself, but I think I did indentify the problem in the corpus: all the taxonomies were in-place (so, in the TEI header) but listPerson and listOrg were factorised. I've now included these two lists into the teiHeader, so that it is completely "unfactorised", and now everything seems to work fine. So, it seems parlamint-factorize-teiHeader.xsl does not work correctly if the teiHeader is partially factorised. Not sure myself if this is a proper bug, because this situation is anomalous, nobody but PL had it. So, @mrudolf, be aware for any future submissions that either the teiHeader is completely factorised, or not at all (unless @matyaskopp fixes his script, in which case it won't matter.