clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

Factorisation: .ana is everywhere #688

Closed TomazErjavec closed 1 year ago

TomazErjavec commented 1 year ago

Running the modified factorisation script I just noticed the problem that factorising a non-factorised .ana corpus root, all the factorised files get the .ana prefix (not only files, but their IDs and the XIncludes in the corpus root as well):

  -rw-rw-r--   1 tomaz tomaz 100965 Jun 10 12:39 ParlaMint-BA.ana.xml
  -rw-rw-r--   1 tomaz tomaz  15829 Jun 10 12:39 ParlaMint-BA-listOrg.ana.xml
  -rw-rw-r--   1 tomaz tomaz 294071 Jun 10 12:39 ParlaMint-BA-listPerson.ana.xml
  -rw-rw-r--   1 tomaz tomaz    726 Jun 10 12:39 ParlaMint-taxonomy-NER.ana.xml
  -rw-rw-r--   1 tomaz tomaz  10238 Jun 10 12:39 ParlaMint-taxonomy-parla.legislature.ana.xml
  -rw-rw-r--   1 tomaz tomaz   1173 Jun 10 12:39 ParlaMint-taxonomy-speaker_types.ana.xml
  -rw-rw-r--   1 tomaz tomaz    910 Jun 10 12:39 ParlaMint-taxonomy-subcorpus.ana.xml
  -rw-rw-r--   1 tomaz tomaz   6320 Jun 10 12:39 ParlaMint-taxonomy-UD-SYN.ana.xml

It this a bug in factorise or am I doint something wrong?

matyaskopp commented 1 year ago

@Tomaž, fixed - error was in both scripts - mine and yours TEI.ana version factorization needs TEI corpus root as a parameter - .ana interfix is added if the file inclusion is not present in the TEI version

TomazErjavec commented 1 year ago

Thanks @matyaskopp, looking good. Now started the build for 3.0.