improve factorize script

matyaskopp commented 1 year ago

What I am doing in factorize-teiHeader

factorize TEI version

get list of included files in teiHeader of the root file of TEI version

factorize TEI.ana version + passing the list of files included in TEI version (the files in this list are not newly created - but ) So the solution is to pass the list and copy these files from the TEI version.

This is rather horrible, as it is contrary to how I do things otherwise, i.e. I first do .ana and then .TEI, as I need to insert the number of words in .ana into .TEI, and here I would have to do it the other way around, rather a mess...

Would it be possible for you to change the script so that the "skip" files are not actually skipped, but that you generate them as usual, except that you give them names as they are in the skip list? Or does that destroy some of your assumptions?

TODO

add new param to script noAna that contains a list of taxonomies/files where the ana interfix will not be included (because it was seen in TEI version)

sample run:

java -jar /usr/share/java/saxon.jar outDir=Data/ParlaMint-GR/factorize-teiHeader \
>    prefix="ParlaMint-GR-" \
>    noAna="ParlaMint-taxonomy-parla.legislature.xml ParlaMint-taxonomy-speaker_types.xml ParlaMint-taxonomy-subcorpus.xml ParlaMint-GR-listOrg.xml ParlaMint-GR-listPerson.xml" \
>    -xsl:Scripts/parlamint-factorize-teiHeader.xsl \
>    Data/ParlaMint-GR/ParlaMint-GR.ana.xml
INFO: Starting to process ParlaMint-GR.ana
INFO: processing root 
Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-parla.legislature.xml
INFO: replacing xml:id parla.legislature with ParlaMint-taxonomy-parla.legislature
Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-speaker_types.xml
INFO: replacing xml:id speaker_types with ParlaMint-taxonomy-speaker_types
Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-subcorpus.xml
INFO: replacing xml:id subcorpus with ParlaMint-taxonomy-subcorpus
Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-NER.ana.xml
INFO: replacing xml:id NER with ParlaMint-taxonomy-NER.ana
Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-UD-SYN.ana.xml
INFO: replacing xml:id UD-SYN with ParlaMint-taxonomy-UD-SYN.ana
Saving listOrg to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-GR-listOrg.xml
INFO: replacing xml:id  with ParlaMint-GR-listOrg
Saving listPerson to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-GR-listPerson.xml
INFO: replacing xml:id  with ParlaMint-GR-listPerson

and the output:

find Data/ParlaMint-GR/factorize-teiHeader
Data/ParlaMint-GR/factorize-teiHeader
Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-speaker_types.xml
Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-UD-SYN.ana.xml
Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-GR-listPerson.xml
Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-parla.legislature.xml
Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-GR-listOrg.xml
Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-subcorpus.xml
Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-GR.ana.xml
Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-NER.ana.xml

matyaskopp commented 1 year ago

@TomazErjavec is it better this way? it does not break my script and you can use a new parameter `noAna="..." in your finalization script

TomazErjavec commented 1 year ago

Yes, this is definitelly much better, thanks. A few minor complaints:

you have INFO with most messages, but not all, e.g. "Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-parla.legislature.xml". It would be aesthecitally pleasing (and potentially useful, i.e. "grep -v INFO") to have INFO everywhere.
the way it is now, my script has to be aware that it is processing e.g. the GR corpus, in order to add the -GR suffix to the listPerson and listOrg. Ideally, I would write just "ParlaMint-listOrg.xml" etc. in the noAna parameter (i.e. have a constant list of factorisable files), and your script would be smart enough to know that it should add the -GR suffix in the filename and its reference.

But none of these is a deal breaker, I can also survive without these tweaks, just let me know.

matyaskopp commented 1 year ago

I believe that your request is now implemented:

java -jar /usr/share/java/saxon.jar \
>                 outDir=Data/ParlaMint-GR/factorize-teiHeader \
>                 noAna="ParlaMint-taxonomy-parla.legislature.xml ParlaMint-taxonomy-speaker_types.xml ParlaMint-taxonomy-subcorpus.xml ParlaMint-listOrg.xml ParlaMint-listPerson.xml" \
>                 -xsl:Scripts/parlamint-factorize-teiHeader.xsl \
>                 Data/ParlaMint-GR/ParlaMint-GR.ana.xml
INFO: Starting to process ParlaMint-GR.ana
INFO: processing root 
INFO: Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-parla.legislature.xml
INFO: replacing xml:id parla.legislature with ParlaMint-taxonomy-parla.legislature
INFO: Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-speaker_types.xml
INFO: replacing xml:id speaker_types with ParlaMint-taxonomy-speaker_types
INFO: Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-subcorpus.xml
INFO: replacing xml:id subcorpus with ParlaMint-taxonomy-subcorpus
INFO: Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-NER.ana.xml
INFO: replacing xml:id NER with ParlaMint-taxonomy-NER.ana
INFO: Saving taxonomy to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-taxonomy-UD-SYN.ana.xml
INFO: replacing xml:id UD-SYN with ParlaMint-taxonomy-UD-SYN.ana
INFO: Saving listOrg to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-GR-listOrg.xml
INFO: replacing xml:id  with ParlaMint-GR-listOrg
INFO: Saving listPerson to Data/ParlaMint-GR/factorize-teiHeader/ParlaMint-GR-listPerson.xml
INFO: replacing xml:id  with ParlaMint-GR-listPerson

If prefix is not defined, then it is derived from /teiCorpus/@xml:id

TomazErjavec commented 1 year ago

Great, just what I wanted and seems to work just fine!

clarin-eric / ParlaMint

improve factorize script #524