clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
50 stars 53 forks source link

Refactorise parlamint-add-common-content and parlamint2final #675

Closed TomazErjavec closed 1 year ago

TomazErjavec commented 1 year ago

We now have to somewhat similar scripts parlamint-add-common-content and parlamint2final, i.e. they both:

It is not good practice anyway to have simiar code in two places, but parlamint2final also has the bug (cf. https://github.com/clarin-eric/ParlaMint/issues/662#issuecomment-1550352079)) that it calculates tagUsages on the original data, but then changes various tags in the bodies of components, leading to incorrect final tagUsages.

I propose that we change this so that:

@matyaskopp, do you agree with this change? I can start implementing it, if so.

matyaskopp commented 1 year ago

My suggestions:

TomazErjavec commented 1 year ago

Nice ones; I'd call it parlamint4v3 :) And I just found #157, will look at it tomorow.

TomazErjavec commented 1 year ago

Here is my proposal for what parlamint-add-common-content should set or correct (both in root and component files, and taking into account number and date formatting, where applicable):

Anything else @matyaskopp? Any change of you implementing this, or shall I? I think parlamint2final does some of the above things better than add-common, so it might be worth comparint the templates.

TomazErjavec commented 1 year ago

Any change of you implementing this, or shall I?

As next week is pretty busy, and there is time pressure to finalise the distro scripts (it will take 3 weeks to reprocess all the corpora), I did this myself, with the final dev commit being https://github.com/clarin-eric/ParlaMint/commit/cc7addeed926e121e159135ac0e51a5bcc1df639.

I think add-common-content works fine (except for some doubts abut BE commission sessions), cf. https://github.com/clarin-eric/ParlaMint/blob/cc7addeed926e121e159135ac0e51a5bcc1df639/Scripts/parlamint-add-common-content.xsl#L687-L691

Also, parlamint2release might be ok too. The idea was also to leave the "fixing" of metadata (like removal GB "special" speakers) in add-common-content (even though it isn't really), but to move everything to do with annotated words into parlamint2release.xsl.

What doesn't work is factorisation, this might be my fault (and, to an extent probably is), but right now the parlamint2distro script has factorisation commented out: https://github.com/clarin-eric/ParlaMint/blob/cc7addeed926e121e159135ac0e51a5bcc1df639/Scripts/parlamint2distro.pl#L335-L337

The reason is, it produces a mess. Options:

Thoughts?

matyaskopp commented 1 year ago

What doesn't work is factorisation, this might be my fault (and, to an extent probably is), but right now the parlamint2distro script has factorisation commented out:

It never worked in a partially factorized corpus, because it checks whether any of the types listOrg, listPerson, taxonomy is factorized: https://github.com/clarin-eric/ParlaMint/blob/cc7addeed926e121e159135ac0e51a5bcc1df639/Scripts/parlamint2distro.pl#L324-L329 If any of these types is factorized, then factorization is skipped: https://github.com/clarin-eric/ParlaMint/blob/cc7addeed926e121e159135ac0e51a5bcc1df639/Scripts/parlamint2distro.pl#L331-L338

I believe the factorization script is working on partially factorized files, but it does not copy files in xi:include... Should I add it there?

TomazErjavec commented 1 year ago

because it checks whether any of the types listOrg, listPerson, taxonomy is factorized ... believe the factorization script is working on partially factorized files

So maybe the fix here would be to check if all the above are factorised?

but it does not copy files in xi:include... Should I add it there?

Yes please!

matyaskopp commented 1 year ago

because it checks whether any of the types listOrg, listPerson, taxonomy is factorized ... believe the factorization script is working on partially factorized files

So maybe the fix here would be to check if all the above are factorised?

I changed the distro script to factorize every time - even if it is already factorized. Not tested !!! https://github.com/clarin-eric/ParlaMint/blob/feba24b9c9e08c3bffedece408d9f0d9c6db55b6/Scripts/parlamint2distro.pl#L324-L342

but it does not copy files in xi:include... Should I add it there?

Yes please!

Done and tested (copy firstly parse xml - so indentation can change in the file)

TomazErjavec commented 1 year ago

Alas, after another 3 hours trying parlamint2distro.pl still doesn't work. I don't really understand the factorisation stuff it seems, and the whole distro with add-common, parlamint2release and factorisation has gotten so complicated that my head hurts.

The distro script now dies with

INFO: using (TEI+TEI.ana)-shared taxonomies from /home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Distro/Test/In/ParlaMint-LV.TEI
INFO: Factorising /home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Distro/Test/Out/ParlaMint-LV.TEI.ana/ParlaMint-LV.ana.xml
INFO: Starting to process ParlaMint-LV.ana
INFO: processing root
INFO: Copying ParlaMint-taxonomy-parla.legislature.xml to /home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Scripts/tmp/eZRJEbv_Cx/factorise/Pa\
rlaMint-taxonomy-parla.legislature.xml
Error at char 14 in expression in xsl:apply-templates/@select on line 127 column 70 of parlamint-factorize-teiHeader.xsl:
  FODC0002  I/O error reported by XML parser processing
  file:/home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Distro/Test/Out/ParlaMint-LV.TEI.ana/ParlaMint-taxonomy-parla.legislature.xml: /home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Distro/Test/Out/ParlaMint-LV.TEI.ana/ParlaMint-taxonomy-parla.legislature.xml (No such file or directory). Caused by java.io.FileNotFoundException: /home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Distro/Test/Out/ParlaMint-LV.TEI.ana/ParlaMint-taxonomy-parla.legislature.xml (No such file or directory)

I would be of coruse grateful for any help, can be directly on tantra of course.

TomazErjavec commented 1 year ago

This has all been now fixed, closing.