clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

add invalid characters validation #586

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

We have scripts for character stats: https://github.com/clarin-eric/ParlaMint/blob/5deaeed5ae792f3ba1726072298885b5b64a6d64/Makefile#L217-L232

But it is not used in the validation procedure.

TODO: extend validate-parlamint.pl with character validation:

NOTE: There is no need to create temporary files. validate-parlamint.pl can contain a list of invalid characters and check if chars.pl output does not contain them

TomazErjavec commented 1 year ago

I am a bit sceptical that this makes sense now, still, maybe for those that have not submitted yet (and future generations parlaminters) it will be useful. I didn't use chars.pl, as it is designed a bit differently, but just incorporated the relevant code into validate-parlamint.pl. Didn't test extensivelly, but hope it works. Can't forbid TAB, as it might appear because of XML indent.

TomazErjavec commented 1 year ago

This has now been implemented - recently also changed ERROR to WARN in case of bad characters, so validation does not fail if some bad chars are encountered. Closing issue, if other things are to be discussed in relation to this, we can open a new one.