clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

SE: invalid characters #595

Open matyaskopp opened 1 year ago

matyaskopp commented 1 year ago

https://github.com/clarin-eric/ParlaMint/actions/runs/4027956603/jobs/6924304418#step:4:297

 INFO: Char validation for ParlaMint-SE_2016-11-16-prot-201617--29.xml
 ERROR: File ParlaMint-SE_2016-11-16-prot-201617--29.xml contains bad chars: U+F0B7 (3x) 
 INFO: Char validation for ParlaMint-SE_2020-11-04-prot-202021--29.xml
 ERROR: File ParlaMint-SE_2020-11-04-prot-202021--29.xml contains bad chars: U+AD (4x)

and also in TEI.ana version

documentation: https://clarin-eric.github.io/ParlaMint/#sec-chars

TomazErjavec commented 1 year ago

@ninpnin , note that this error showed up beacause we now incorporate character validity checking in the validate-parlamint script. As you already finished 3.0, you can address this for 3.1 if you wish. But as we promised that there will be no requirements for changing the content of segments, it is not obligatory. Which doesn't meant we wouldn't be happy to see this corrected, even in 3.0!