Closed TomazErjavec closed 1 year ago
My suggestions:
parlamint-fix-known-bugs
to avoid temptations for adding features that should be in a different scriptparlamint-add-common-content
can be extended with
Nice ones; I'd call it parlamint4v3 :) And I just found #157, will look at it tomorow.
Here is my proposal for what parlamint-add-common-content should set or correct (both in root and component files, and taking into account number and date formatting, where applicable):
Anything else @matyaskopp? Any change of you implementing this, or shall I? I think parlamint2final does some of the above things better than add-common, so it might be worth comparint the templates.
Any change of you implementing this, or shall I?
As next week is pretty busy, and there is time pressure to finalise the distro scripts (it will take 3 weeks to reprocess all the corpora), I did this myself, with the final dev commit being https://github.com/clarin-eric/ParlaMint/commit/cc7addeed926e121e159135ac0e51a5bcc1df639.
I think add-common-content works fine (except for some doubts abut BE commission sessions), cf. https://github.com/clarin-eric/ParlaMint/blob/cc7addeed926e121e159135ac0e51a5bcc1df639/Scripts/parlamint-add-common-content.xsl#L687-L691
Also, parlamint2release might be ok too. The idea was also to leave the "fixing" of metadata (like removal GB "special" speakers) in add-common-content (even though it isn't really), but to move everything to do with annotated words into parlamint2release.xsl.
What doesn't work is factorisation, this might be my fault (and, to an extent probably is), but right now the parlamint2distro script has factorisation commented out: https://github.com/clarin-eric/ParlaMint/blob/cc7addeed926e121e159135ac0e51a5bcc1df639/Scripts/parlamint2distro.pl#L335-L337
The reason is, it produces a mess. Options:
Thoughts?
What doesn't work is factorisation, this might be my fault (and, to an extent probably is), but right now the parlamint2distro script has factorisation commented out:
It never worked in a partially factorized corpus, because it checks whether any of the types listOrg, listPerson, taxonomy is factorized: https://github.com/clarin-eric/ParlaMint/blob/cc7addeed926e121e159135ac0e51a5bcc1df639/Scripts/parlamint2distro.pl#L324-L329 If any of these types is factorized, then factorization is skipped: https://github.com/clarin-eric/ParlaMint/blob/cc7addeed926e121e159135ac0e51a5bcc1df639/Scripts/parlamint2distro.pl#L331-L338
I believe the factorization script is working on partially factorized files, but it does not copy files in xi:include
...
Should I add it there?
because it checks whether any of the types listOrg, listPerson, taxonomy is factorized ... believe the factorization script is working on partially factorized files
So maybe the fix here would be to check if all the above are factorised?
but it does not copy files in xi:include... Should I add it there?
Yes please!
because it checks whether any of the types listOrg, listPerson, taxonomy is factorized ... believe the factorization script is working on partially factorized files
So maybe the fix here would be to check if all the above are factorised?
I changed the distro script to factorize every time - even if it is already factorized. Not tested !!! https://github.com/clarin-eric/ParlaMint/blob/feba24b9c9e08c3bffedece408d9f0d9c6db55b6/Scripts/parlamint2distro.pl#L324-L342
but it does not copy files in xi:include... Should I add it there?
Yes please!
Done and tested (copy firstly parse xml - so indentation can change in the file)
Alas, after another 3 hours trying parlamint2distro.pl still doesn't work. I don't really understand the factorisation stuff it seems, and the whole distro with add-common, parlamint2release and factorisation has gotten so complicated that my head hurts.
The distro script now dies with
INFO: using (TEI+TEI.ana)-shared taxonomies from /home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Distro/Test/In/ParlaMint-LV.TEI
INFO: Factorising /home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Distro/Test/Out/ParlaMint-LV.TEI.ana/ParlaMint-LV.ana.xml
INFO: Starting to process ParlaMint-LV.ana
INFO: processing root
INFO: Copying ParlaMint-taxonomy-parla.legislature.xml to /home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Scripts/tmp/eZRJEbv_Cx/factorise/Pa\
rlaMint-taxonomy-parla.legislature.xml
Error at char 14 in expression in xsl:apply-templates/@select on line 127 column 70 of parlamint-factorize-teiHeader.xsl:
FODC0002 I/O error reported by XML parser processing
file:/home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Distro/Test/Out/ParlaMint-LV.TEI.ana/ParlaMint-taxonomy-parla.legislature.xml: /home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Distro/Test/Out/ParlaMint-LV.TEI.ana/ParlaMint-taxonomy-parla.legislature.xml (No such file or directory). Caused by java.io.FileNotFoundException: /home/project/corpora/Parla/ParlaMint/ParlaMint-V3/Distro/Test/Out/ParlaMint-LV.TEI.ana/ParlaMint-taxonomy-parla.legislature.xml (No such file or directory)
I would be of coruse grateful for any help, can be directly on tantra of course.
This has all been now fixed, closing.
We now have to somewhat similar scripts parlamint-add-common-content and parlamint2final, i.e. they both:
It is not good practice anyway to have simiar code in two places, but parlamint2final also has the bug (cf. https://github.com/clarin-eric/ParlaMint/issues/662#issuecomment-1550352079)) that it calculates tagUsages on the original data, but then changes various tags in the bodies of components, leading to incorrect final tagUsages.
I propose that we change this so that:
@matyaskopp, do you agree with this change? I can start implementing it, if so.