Closed matyaskopp closed 1 year ago
Yes. I don't know why the transcribers put it there, but they did. Source file: https://data.stortinget.no/eksport/publikasjon?publikasjonid=refs-201819-12-11
We unfortunately don't have capacity to manually check 24 years of parliamentary debates. There are many spelling errors, unconventional use of characters and so on. All the national corpora must be like this. And correcting this particular error would be quite a bit of work, as simply deleting the symbol would surely break the dependency links in the .ana file.
@matyaskopp I would really prefer to leave it as is.
Agree that this character is one of many (but it is really terribly looking). But it seems that it is allowed in TEI (@TomazErjavec ?)
There is another issue with characters (see documentation: https://clarin-eric.github.io/ParlaMint/#sec-chars) I have run Script/chars.pl and Scripts/chars-summ.pl on your ParlaMint-NO.TEI data and filtered HYPHEN and SPACE characters:
Code Char Occurs % In docs % Unicode name
U+0020 <CTRL> 156510934 18.44 3280 100.00 SPACE
U+002D - 6706215 0.79 3280 100.00 HYPHEN-MINUS
U+2002 154 0.00 31 0.95 EN SPACE
U+2003 175 0.00 44 1.34 EM SPACE
U+2005 5 0.00 4 0.12 FOUR-PER-EM SPACE
U+2006 9 0.00 8 0.24 SIX-PER-EM SPACE
U+2009 779 0.00 155 4.73 THIN SPACE
U+200A 6 0.00 3 0.09 HAIR SPACE
U+2011 ‑ 73 0.00 12 0.37 NON-BREAKING HYPHEN
U+0020
and U+002D
are ok, but the rest should be replaced
@matyaskopp If these are illegal why are they not mentioned in the list of illegal characters, and why were they not checked for in validation?
We simply do not have the capacity to redo the corpus now. If this was an issue, we really would have needed to know about this before.
OK I see some of them are among the illegal chars. Unfortunate. Should have been normalized. Quickly normalizing the nonestandard spaces to U 0020 and U 2011 and the U 2011 to U 002D should be quickly done. Hopefully it won't break the ana docs. But the Replacement char I don't know if we are able to deal with. It could be normalized maybe, but to what... anyway, it seems it is not an illegal character at least
I ran a simple replace script on the corpus. That should do it for the whitespace and hyphen.
@matyaskopp If these are illegal why are they not mentioned in the list of illegal characters, and why were they not checked for in validation?
I agree that the validation is far from complete (added an issue #586 ) - there are many documented features that are not validated, but the documentation is quite clear.
I ran a simple replace script on the corpus. That should do it for the whitespace and hyphen.
Thanks, @TomazErjavec, can you update the NO corpus, please?
Thanks, @TomazErjavec, can you update the NO corpus, please?
Already did it but forgot to let you know, sorry! The log https://nl.ijs.si/et/tmp/ParlaMint/Repo/ParlaMint-NO.log and the corpus is also on the beta concordancer https://www.clarin.si/noske-beta/parlamint30.cgi/corp_info?corpname=parlamint30_no&struct_attr_stats=1&subcorpora=1
Looks ok to me, except the (already seen) CoNLL-U parse problems.
@tungland, do you plan to fix the remaining char issues for 3.1 (and I put that milestone to this issue), or nor, and we close it?
I think i already submitted a corpus without these? It was way back
OK, great, so, closing.
I think i already submitted a corpus without these? It was way back
Edit: yes, see this comment https://github.com/clarin-eric/ParlaMint/issues/583#issuecomment-1387256829
documents
ParlaMint-NO_2018-12-10.xml
andParlaMint-NO_2018-12-10.ana.xml
containst crazy characters: Begining of the file