clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

invalid subcorpus in meta and vert files #707

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

This is probably already fixed because English version contains the correct subcorpora:

Every War subcorpus should also be COVID image

it is OK in TEI: https://github.com/clarin-eric/ParlaMint/blob/535dae3f802d20ea053e76899ddcf6ab805049c0/Data/ParlaMint-AT/ParlaMint-AT_2022-05-19-027-XXVII-NRSITZ-00159.xml#L4 and TEI.ana: https://github.com/clarin-eric/ParlaMint/blob/535dae3f802d20ea053e76899ddcf6ab805049c0/Data/ParlaMint-AT/ParlaMint-AT_2022-05-19-027-XXVII-NRSITZ-00159.ana.xml#L4

but not in meta: https://github.com/clarin-eric/ParlaMint/blob/535dae3f802d20ea053e76899ddcf6ab805049c0/Data/ParlaMint-AT/ParlaMint-AT_2022-05-19-027-XXVII-NRSITZ-00159-meta.tsv?plain=1#L2 and vert: https://github.com/clarin-eric/ParlaMint/blob/535dae3f802d20ea053e76899ddcf6ab805049c0/Data/ParlaMint-AT/ParlaMint-AT_2022-05-19-027-XXVII-NRSITZ-00159.vert#L6

TomazErjavec commented 1 year ago

Hm, tricky one, because retaining only one subcorpus is supposed to be a feature: https://github.com/clarin-eric/ParlaMint/blob/535dae3f802d20ea053e76899ddcf6ab805049c0/Scripts/parlamint-lib.xsl#L99-L103.

The idea was that it is easier (i.e. in the concordancer) not to have two subcorpora assigned to a component, because then it is difficult to specify a subcorpus that only contains "War".

Right now the resulting values are wrong in any case, as they are inconsistent, so the vertical files have to be done again. Still, @matyaskopp, shall I keep just one subcorpus value in the tsv / vert, or is it better to have two?

TomazErjavec commented 1 year ago

OK, my solution didn't work anyway, so in 8be490a I now keep the TEI perspective, i.e. a corpus can be both COVID and War.