clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

improve factorize script #524

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

What I am doing in factorize-teiHeader

  • factorize TEI version
  • get list of included files in teiHeader of the root file of TEI version
  • factorize TEI.ana version + passing the list of files included in TEI version (the files in this list are not newly created - but ) So the solution is to pass the list and copy these files from the TEI version.

This is rather horrible, as it is contrary to how I do things otherwise, i.e. I first do .ana and then .TEI, as I need to insert the number of words in .ana into .TEI, and here I would have to do it the other way around, rather a mess...

Would it be possible for you to change the script so that the "skip" files are not actually skipped, but that you generate them as usual, except that you give them names as they are in the skip list? Or does that destroy some of your assumptions?

TODO

sample run:

java -jar /usr/share/java/saxon.jar outDir=Data/ParlaMint-GR/factorize-teiHeader \
>    prefix="ParlaMint-GR-" \
>    noAna="ParlaMint-taxonomy-parla.legislature.xml ParlaMint-taxonomy-speaker_types.xml ParlaMint-taxonomy-subcorpus.xml ParlaMint-GR-listOrg.xml ParlaMint-GR-listPerson.xml" \
>    -xsl:Scripts/parlamint-factorize-teiHeader.xsl \
>    Data/ParlaMint-GR/ParlaMint-GR.ana.xml
INFO: Starting to process ParlaMint-GR.ana
INFO: processing root 
Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-parla.legislature.xml
INFO: replacing xml:id parla.legislature with ParlaMint-taxonomy-parla.legislature
Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-speaker_types.xml
INFO: replacing xml:id speaker_types with ParlaMint-taxonomy-speaker_types
Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-subcorpus.xml
INFO: replacing xml:id subcorpus with ParlaMint-taxonomy-subcorpus
Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-NER.ana.xml
INFO: replacing xml:id NER with ParlaMint-taxonomy-NER.ana
Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-UD-SYN.ana.xml
INFO: replacing xml:id UD-SYN with ParlaMint-taxonomy-UD-SYN.ana
Saving listOrg to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-GR-listOrg.xml
INFO: replacing xml:id  with ParlaMint-GR-listOrg
Saving listPerson to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-GR-listPerson.xml
INFO: replacing xml:id  with ParlaMint-GR-listPerson

and the output:

find Data/ParlaMint-GR/factorize-teiHeader
Data/ParlaMint-GR/factorize-teiHeader
Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-speaker_types.xml
Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-UD-SYN.ana.xml
Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-GR-listPerson.xml
Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-parla.legislature.xml
Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-GR-listOrg.xml
Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-subcorpus.xml
Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-GR.ana.xml
Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-NER.ana.xml
matyaskopp commented 1 year ago

@TomazErjavec is it better this way? it does not break my script and you can use a new parameter `noAna="..." in your finalization script

TomazErjavec commented 1 year ago

Yes, this is definitelly much better, thanks. A few minor complaints:

But none of these is a deal breaker, I can also survive without these tweaks, just let me know.

matyaskopp commented 1 year ago

I believe that your request is now implemented:

java -jar /usr/share/java/saxon.jar \
>                 outDir=Data/ParlaMint-GR/factorize-teiHeader \
>                 noAna="ParlaMint-taxonomy-parla.legislature.xml ParlaMint-taxonomy-speaker_types.xml ParlaMint-taxonomy-subcorpus.xml ParlaMint-listOrg.xml ParlaMint-listPerson.xml" \
>                 -xsl:Scripts/parlamint-factorize-teiHeader.xsl \
>                 Data/ParlaMint-GR/ParlaMint-GR.ana.xml
INFO: Starting to process ParlaMint-GR.ana
INFO: processing root 
INFO: Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-parla.legislature.xml
INFO: replacing xml:id parla.legislature with ParlaMint-taxonomy-parla.legislature
INFO: Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-speaker_types.xml
INFO: replacing xml:id speaker_types with ParlaMint-taxonomy-speaker_types
INFO: Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-subcorpus.xml
INFO: replacing xml:id subcorpus with ParlaMint-taxonomy-subcorpus
INFO: Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-NER.ana.xml
INFO: replacing xml:id NER with ParlaMint-taxonomy-NER.ana
INFO: Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-UD-SYN.ana.xml
INFO: replacing xml:id UD-SYN with ParlaMint-taxonomy-UD-SYN.ana
INFO: Saving listOrg to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-GR-listOrg.xml
INFO: replacing xml:id  with ParlaMint-GR-listOrg
INFO: Saving listPerson to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-GR-listPerson.xml
INFO: replacing xml:id  with ParlaMint-GR-listPerson

If prefix is not defined, then it is derived from /teiCorpus/@xml:id

TomazErjavec commented 1 year ago

Great, just what I wanted and seems to work just fine!