clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

factorization before finalization distribution #681

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

Prepare script for factorization data before finalization with distro script

matyaskopp commented 1 year ago

@TomazErjavec

I have added the target test-factorize to the makefile: https://github.com/clarin-eric/ParlaMint/blob/1424c3e21f1da972587fae3bfdb74499ba14ab39/Distro/Makefile#L51-L62

running

make test-factorize CORPUS=BA
make test-factorize CORPUS=LV

you get this result:

du -h Test/Factorized/*/*
12K Test/Factorized/ParlaMint-BA.TEI.ana/ParlaMint-BA.ana.xml
16K Test/Factorized/ParlaMint-BA.TEI.ana/ParlaMint-BA-listOrg.ana.xml
288K    Test/Factorized/ParlaMint-BA.TEI.ana/ParlaMint-BA-listPerson.ana.xml
4,0K    Test/Factorized/ParlaMint-BA.TEI.ana/ParlaMint-taxonomy-NER.ana.xml
12K Test/Factorized/ParlaMint-BA.TEI.ana/ParlaMint-taxonomy-parla.legislature.xml
4,0K    Test/Factorized/ParlaMint-BA.TEI.ana/ParlaMint-taxonomy-speaker_types.xml
4,0K    Test/Factorized/ParlaMint-BA.TEI.ana/ParlaMint-taxonomy-subcorpus.xml
8,0K    Test/Factorized/ParlaMint-BA.TEI.ana/ParlaMint-taxonomy-UD-SYN.ana.xml
16K Test/Factorized/ParlaMint-BA.TEI/ParlaMint-BA-listOrg.xml
288K    Test/Factorized/ParlaMint-BA.TEI/ParlaMint-BA-listPerson.xml
12K Test/Factorized/ParlaMint-BA.TEI/ParlaMint-BA.xml
12K Test/Factorized/ParlaMint-BA.TEI/ParlaMint-taxonomy-parla.legislature.xml
4,0K    Test/Factorized/ParlaMint-BA.TEI/ParlaMint-taxonomy-speaker_types.xml
4,0K    Test/Factorized/ParlaMint-BA.TEI/ParlaMint-taxonomy-subcorpus.xml
8,0K    Test/Factorized/ParlaMint-LV.TEI.ana/ParlaMint-LV.ana.xml
8,0K    Test/Factorized/ParlaMint-LV.TEI.ana/ParlaMint-LV-listOrg.xml
144K    Test/Factorized/ParlaMint-LV.TEI.ana/ParlaMint-LV-listPerson.xml
4,0K    Test/Factorized/ParlaMint-LV.TEI.ana/ParlaMint-taxonomy-NER.ana.xml
8,0K    Test/Factorized/ParlaMint-LV.TEI.ana/ParlaMint-taxonomy-parla.legislature.xml
4,0K    Test/Factorized/ParlaMint-LV.TEI.ana/ParlaMint-taxonomy-speaker_types.xml
4,0K    Test/Factorized/ParlaMint-LV.TEI.ana/ParlaMint-taxonomy-subcorpus.xml
72K Test/Factorized/ParlaMint-LV.TEI.ana/ParlaMint-taxonomy-UD-SYN.ana.xml
8,0K    Test/Factorized/ParlaMint-LV.TEI/ParlaMint-LV-listOrg.xml
144K    Test/Factorized/ParlaMint-LV.TEI/ParlaMint-LV-listPerson.xml
8,0K    Test/Factorized/ParlaMint-LV.TEI/ParlaMint-LV.xml
8,0K    Test/Factorized/ParlaMint-LV.TEI/ParlaMint-taxonomy-parla.legislature.xml
4,0K    Test/Factorized/ParlaMint-LV.TEI/ParlaMint-taxonomy-speaker_types.xml
4,0K    Test/Factorized/ParlaMint-LV.TEI/ParlaMint-taxonomy-subcorpus.xml

The component files are not copies. I am not sure how to handle this - I don't want to copy all files if only root files are changed. The distro script expects everything in one folder....

TomazErjavec commented 1 year ago

Thank you @matyaskopp. On this basis I now made in 794d629 parlamint-factorize-corpora.pl that factorised all the submitted corpora, which can then serve as input to the distribution script. So, all done here, closing.