clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
50 stars 53 forks source link

Problem with parlamint-factorize-teiHeader.xsl? #636

Closed TomazErjavec closed 1 year ago

TomazErjavec commented 1 year ago

I have re-processed all the submitted corpora, and PL showed errors for missing taxonomies. I had a look, and it turns out parlamint-factorize-teiHeader.xsl did not factorise the teiHeader appropriatelly. I didn't manage to find the problem in the script itself, but I think I did indentify the problem in the corpus: all the taxonomies were in-place (so, in the TEI header) but listPerson and listOrg were factorised. I've now included these two lists into the teiHeader, so that it is completely "unfactorised", and now everything seems to work fine. So, it seems parlamint-factorize-teiHeader.xsl does not work correctly if the teiHeader is partially factorised. Not sure myself if this is a proper bug, because this situation is anomalous, nobody but PL had it. So, @mrudolf, be aware for any future submissions that either the teiHeader is completely factorised, or not at all (unless @matyaskopp fixes his script, in which case it won't matter.

mrudolf commented 1 year ago

Can I see an example, because I am not sure I correctly understand what is wrong?

Wiadomość napisana przez Tomaž Erjavec @.***> w dniu 21.04.2023, o godz. 21:16:

I have re-processed all the submitted corpora, and PL showed errors for missing taxonomies. I had a look, and it turns out parlamint-factorize-teiHeader.xsl did not factorise the teiHeader appropriatelly. I didn't manage to find the problem in the script itself, but I think I did indentify the problem in the corpus: all the taxonomies were in-place (so, in the TEI header) but listPerson and listOrg were factorised. I've now included these two lists into the teiHeader, so that it is completely "unfactorised", and now everything seems to work fine. So, it seems parlamint-factorize-teiHeader.xsl does not work correctly if the teiHeader is partially factorised. Not sure myself if this is a proper bug, because this situation is anomalous, nobody but PL had it. So, @mrudolf https://github.com/mrudolf, be aware for any future submissions that either the teiHeader is completely factorised, or not at all (unless @matyaskopp https://github.com/matyaskopp fixes his script, in which case it won't matter.

— Reply to this email directly, view it on GitHub https://github.com/clarin-eric/ParlaMint/issues/636, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAS5RALOW4C24UD3JZOCMCTXCLMKBANCNFSM6AAAAAAXHHJUQI. You are receiving this because you were mentioned.

— Michał Rudolf

TomazErjavec commented 1 year ago

Can I see an example, because I am not sure I correctly understand what is wrong?

The partners submitted either their root file (ParlaMint-XX.xml / ParlaMint-AT.ana.xml) which has all the taxonomies, listPerson and listOrg included in its teiHeader or they have submitted the root file which XIncludes the taxonomies, listPerson and listOrg, e.g.

  -rw-r--r--   1 tomaz tomaz  40920 Dec 14 15:59 ParlaMint-AT-listOrg.xml
  -rw-r--r--   1 tomaz tomaz 881591 Feb 10 16:20 ParlaMint-AT-listPerson.xml
  -rw-r--r--   1 tomaz tomaz 173905 Dec 16 12:14 ParlaMint-AT.xml
  -rw-r--r--   1 tomaz tomaz   4152 Dec 14 15:59 ParlaMint-taxonomy-parla.legislature.xml
  -rw-r--r--   1 tomaz tomaz   1129 Dec 14 15:59 ParlaMint-taxonomy-speaker_types.xml
  -rw-r--r--   1 tomaz tomaz    775 Dec 14 15:59 ParlaMint-taxonomy-subcorpus.xml

But you have the taxonomies included in the teiHeader, but listPerson and listOrg XIncluded:

  -rw-r--r--   1 tomaz tomaz   3733 Mar 28 11:23 ParlaMint-PL-listOrg.xml
  -rw-r--r--   1 tomaz tomaz 500454 Mar 28 11:23 ParlaMint-PL-listPerson.xml
  -rw-rw-r--   1 tomaz tomaz  66424 Mar 28 18:25 ParlaMint-PL.xml

For the corpora that will be distributed, I do the factorisation, i.e. that your corpus winds up like:

  -rw-rw-r--   1 tomaz tomaz   3860 Apr 22 01:53 ParlaMint-PL-listOrg.xml
  -rw-rw-r--   1 tomaz tomaz 524821 Apr 22 01:53 ParlaMint-PL-listPerson.xml
  -rw-rw-r--   1 tomaz tomaz  55067 Apr 22 01:53 ParlaMint-PL.xml
  -rw-rw-r--   1 tomaz tomaz   8091 Apr 22 01:53 ParlaMint-taxonomy-parla.legislature.xml
  -rw-rw-r--   1 tomaz tomaz    966 Apr 22 01:53 ParlaMint-taxonomy-speaker_types.xml
  -rw-rw-r--   1 tomaz tomaz    818 Apr 22 01:53 ParlaMint-taxonomy-subcorpus.xml

But the factorisation script does not seem to like that you have a "semi factorised" root file. So, what I did now, is to put listPerson and listOrg back into the teiHeader of the root file, so it is simply:

 -rw-rw-r--   1 tomaz tomaz 587393 Apr 21 20:27 ParlaMint-PL.ana.xml

and then the factorisation works fine, with the resut that you can find at https://nl.ijs.si/et/tmp/ParlaMint/Repo/ParlaMint-PL.tgz and https://nl.ijs.si/et/tmp/ParlaMint/Repo/ParlaMint-PT.ana.tgz

So, your files are ok at my end, but maybe you should do the same for any further submissions, i.e. just put listPerson and listOrg inside the teiHeader.

matyaskopp commented 1 year ago

@TomazErjavec, I have discovered different issue with the way how you use the script.

I am calling it this way (in sample - both TEI and TEI.ana in the same folder): https://github.com/clarin-eric/ParlaMint/blob/91ba6eb0c7e82240638c5b11e25c47237b8be619/Makefile#L307-L319 First, factorize TEI version and then factorize TEI.ana and skip files seen in TEI version.

Your script does not count with IS situation (custom taxonomy in both TEI and TEI.ana versions):

$ ls -l ParlaMint-IS.TEI*/ParlaMint-*
-rw-rw-r-- 1 tomaz tomaz   5683 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-IS-listOrg.xml
-rw-rw-r-- 1 tomaz tomaz 198355 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-IS-listPerson.xml
-rw-rw-r-- 1 tomaz tomaz   6800 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-IS-taxonomy-gov.ministries.ana.xml
-rw-rw-r-- 1 tomaz tomaz   9120 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-IS-taxonomy-parla.categories.ana.xml
-rw-rw-r-- 1 tomaz tomaz   5424 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-IS-taxonomy-parla.constituencies.ana.xml
-rw-rw-r-- 1 tomaz tomaz 490205 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-IS-taxonomy-parla.topics.ana.xml
-rw-rw-r-- 1 tomaz tomaz 109368 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-IS.ana.xml
-rw-rw-r-- 1 tomaz tomaz    865 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-taxonomy-NER.ana.xml
-rw-rw-r-- 1 tomaz tomaz   4672 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-taxonomy-UD-SYN.ana.xml
-rw-rw-r-- 1 tomaz tomaz   5426 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-taxonomy-parla.legislature.xml
-rw-rw-r-- 1 tomaz tomaz    941 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-taxonomy-speaker_types.xml
-rw-rw-r-- 1 tomaz tomaz    818 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-taxonomy-subcorpus.xml
-rw-rw-r-- 1 tomaz tomaz   5683 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-IS-listOrg.xml
-rw-rw-r-- 1 tomaz tomaz 198355 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-IS-listPerson.xml
-rw-rw-r-- 1 tomaz tomaz   6796 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-IS-taxonomy-gov.ministries.xml
-rw-rw-r-- 1 tomaz tomaz   9116 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-IS-taxonomy-parla.categories.xml
-rw-rw-r-- 1 tomaz tomaz   5420 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-IS-taxonomy-parla.constituencies.xml
-rw-rw-r-- 1 tomaz tomaz 490201 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-IS-taxonomy-parla.topics.xml
-rw-rw-r-- 1 tomaz tomaz 102776 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-IS.xml
-rw-rw-r-- 1 tomaz tomaz   5426 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-taxonomy-parla.legislature.xml
-rw-rw-r-- 1 tomaz tomaz    941 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-taxonomy-speaker_types.xml
-rw-rw-r-- 1 tomaz tomaz    818 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-taxonomy-subcorpus.xml

So you get two similar files with different names. e.g. (the size is different due to different /taxonomy/@xml:id)

-rw-rw-r-- 1 tomaz tomaz   9120 dub 26 17:33 ParlaMint-IS.TEI.ana/ParlaMint-IS-taxonomy-parla.categories.ana.xml
-rw-rw-r-- 1 tomaz tomaz   9116 dub 26 18:15 ParlaMint-IS.TEI/ParlaMint-IS-taxonomy-parla.categories.xml

For an annotated version, you should extend the parameter noAna with files seen in TEI version: https://github.com/clarin-eric/ParlaMint/blob/64229bb1a752f2b0073f103b7776dbece9527a7a/Scripts/parlamint2distro.pl#L309

Partial factorization is caused by your script: variable $factorised = 1 https://github.com/clarin-eric/ParlaMint/blob/64229bb1a752f2b0073f103b7776dbece9527a7a/Scripts/parlamint2distro.pl#L298-L311

matyaskopp commented 1 year ago

I believe this is now solved with feba24b9c9e08c3bffedece408d9f0d9c6db55b6. It is also discussed here #675