clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

tagUsage calculation in AT corpus #662

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

AT corpus has wrong numbers in tagUsage in /project/corpora/Parla/ParlaMint/ParlaMint-full/Data/Corpora folder:

All corpus files look like this: https://github.com/clarin-eric/ParlaMint/blob/392e2ee930e764d09045ea0e827de6c57d2afe2c/Data/ParlaMint-AT/ParlaMint-AT.xml#L112-L128

And component files: https://github.com/clarin-eric/ParlaMint/blob/392e2ee930e764d09045ea0e827de6c57d2afe2c/Data/ParlaMint-AT/ParlaMint-AT_2005-03-31-022-XXII-NRSITZ-00100.xml#L110-L126

I guess that the finalization script does not calculate these numbers and only AT set 1 into component files

matyaskopp commented 1 year ago

Now I see: https://github.com/clarin-eric/ParlaMint/blob/819add4ccdecad8faac712b22c618002ac76b6e7/Scripts/parlamint2distro.pl#L131 https://github.com/clarin-eric/ParlaMint/blob/819add4ccdecad8faac712b22c618002ac76b6e7/Scripts/parlamint2final.xsl#L21

parlamint2final is not calculating tagUsage


tagUsage calculation is implemented in https://github.com/clarin-eric/ParlaMint/blob/819add4ccdecad8faac712b22c618002ac76b6e7/Scripts/parlamint-add-common-content.xsl#L12 which is not used in the finalization

TomazErjavec commented 1 year ago

I thought everybody computes their tagUsages, but notied AT a couple of day ago myself. I now inserted your calculation into finalize but it is a doomed effort, because I change the countable markup for ES-GA and now also IS (names without words, a but which floated to the top only in the MTed corpus), hm. I guess we should do my fixings first, and then just use add-common (although my version of add-common does things yours doesn't:). Would you dare try it, or is that too much to hope for, I'm afraid of introducing even more confusion! Or maybe we live with the fact that tagusages will be slightly off for 3.0, and hope to do better in 3.1?

TomazErjavec commented 1 year ago

Discussion on this continues in #675, closing this one.